A guide to utilizing ChIPpeakAnno (Lihua J. Zhu et al. 2010; Lihua Julie Zhu 2013) for common analysis of peak calling data identified with the classic ChIP-seq or the cutting-edge CUT&RUN experiments and their variants. ChIPpeakAnno offers a range of functions for peak annotaion (the process of mapping peaks to the nearest genomic features such as gene and miRNA), visualization, and peak profile comparison.
ChIPpeakAnno 3.35.3
Several experimental techniques including ChIP-ChIP (Blat and Kleckner 1999), ChIP-seq (Barski et al. 2007; Johnson et al. 2007), CUT&RUN (Skene and Henikoff 2017) have gained widespread use in molecular biology. These techniques enable the study of protein-DNA interactions by identifying the genomic regions that are associated with specific proteins, such as transcription factor binding sites, or epigenetic markers like histone modification. Through these approaches, researchers gain valuable insights into the regulation of genes and the structure of chromatin.
The analysis of experimental data involves several key steps, which are outlined below. The steps where the ChIPpeakAnno package can be utilized are displayed in bold.
Various algorithms have been developed and published to identify peaks from experimental data (Thomas et al. 2017). Once the peak files are obtained, they are commonly converted into formats like BED or bigWig. These formats allow the data to be easily loaded into genome browsers, such as the UCSC Genome Browser, as custom tracks. This enables investigators to examine the proximity of the peaks to different genomic features like genes, exons, or other conserved elements. However manually navigating through the genome browser can be a daunting task due to the typically large number of peaks that are spread across the genome.
The Bioconductor package ChIPpeakAnno is a pioneering tool that facilitates the batch annotation of peaks obtained from various technologies that result in a large number of enriched genomic regions. One notable feature of ChIPpeakAnno is its ability to dynamically retrieve up-to-date annotations by leveraging the biomRt package. This ensures that users have access to the most current annotation information. Additionally, users have the flexibility to provide their own annotation data or utilizing existing annotation packages. Furthermore, ChIPpeakAnno provides functions to retrieve sequences surrounding the peaks, which can be utilized for peak validation and motif discovery (Durinck et al. 2005). The package also facilitates the identification of enriched GO or pathway terms associated with the peaks. Moreover, ChIPpeakAnno includes several helper functions that aid in visualizing binding patterns and comparing peak profiles across multiple samples or experimental conditions.
ChIPpeakAnno has been continuously enhanced since its initial release in 2010, as evident from its active development1 https://github.com/jianhong/ChIPpeakAnno/blob/devel/NEWS. New features are regularly added based on user requests (Section ??). If you have any questions regarding its usage, please search for answers or post new questions on the Bioconductor Support Site. Additionally, if you have any suggestions for novel features, encounter any bugs, or wish to contribute in any way, please feel free to raise an issue or submit a pull request on the ChIPpeakAnno GitHub repository. Your input and contributions will be greatly appreciated and carefully considered.
This section provides an example on utilizing ChIPpeakAnno to annotate peaks identified with MACS or MACS2 (a popular peak calling tool) (Zhang et al. 2008).
The process of annotating peak data typically involves several steps, each of which will be demonstrated individually in the following sections:
annotatePeakInBatch (Section @ref(prepanno?))addGeneIDs (Section @ref(evalcon?))In Section ?? to ??, we delve into detailed use cases, showcasing the versatility of ChIPpeakAnno in various scenarios. Section ?? to ?? present a series of commonly employed analytical pipelines, offering step-by-step illustrations for different settings.
The ChIPpeakAnno package provides an example peak file in xls format, which is generated by MACS. To work with this example file, we need to locate it with system.file and convert it into a GRanges object using toGRanges function.
library(ChIPpeakAnno)
macs_peak <- system.file("extdata", "MACS_peaks.xls",
package = "ChIPpeakAnno")
macs_peak_gr2 <- toGRanges(macs_peak, format = "MACS")
head(macs_peak_gr2, n = 2)
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | length summit tags
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## X01 chr1 25323511-25324015 * | 505 252 45
## X02 chr1 25362685-25362997 * | 313 211 33
## -10*log10(pvalue) fold_enrichment FDR
## <numeric> <numeric> <numeric>
## X01 59.17 17.01 5.8
## X02 60.63 22.41 4.2
## -------
## seqinfo: 12 sequences from an unspecified genome; no seqlengths
The section presents annotations from two sources: the Ensembl annotation package and the UCSC annotation package. For more information on annotation file preparation, please refer to the Section 4.
library(EnsDb.Hsapiens.v86)
ensembl.hs86.gene <- genes(EnsDb.Hsapiens.v86)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
macs_peak_ensembl <- annotatePeakInBatch(macs_peak_gr2,
AnnotationData = ensembl.hs86.gene)
head(macs_peak_ensembl, n = 2)
## GRanges object with 2 ranges and 15 metadata columns:
## seqnames ranges strand | length summit
## <Rle> <IRanges> <Rle> | <integer> <integer>
## X01.ENSG00000259984 chr1 25323511-25324015 * | 505 252
## X02.ENSG00000117616 chr1 25362685-25362997 * | 313 211
## tags -10*log10(pvalue) fold_enrichment FDR
## <integer> <numeric> <numeric> <numeric>
## X01.ENSG00000259984 45 59.17 17.01 5.8
## X02.ENSG00000117616 33 60.63 22.41 4.2
## peak feature start_position end_position
## <character> <character> <integer> <integer>
## X01.ENSG00000259984 X01 ENSG00000259984 25336429 25337465
## X02.ENSG00000117616 X02 ENSG00000117616 25242237 25338213
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X01.ENSG00000259984 - downstream 13954
## X02.ENSG00000117616 - upstream -24472
## shortestDistance fromOverlappingOrNearest
## <integer> <character>
## X01.ENSG00000259984 12414 NearestLocation
## X02.ENSG00000117616 24472 NearestLocation
## -------
## seqinfo: 12 sequences from an unspecified genome; no seqlengths
macs_peak_ucsc <- annotatePeakInBatch(macs_peak_gr2,
AnnotationData = ucsc.hg38.knownGene)
head(macs_peak_ucsc, n = 2)
## GRanges object with 2 ranges and 15 metadata columns:
## seqnames ranges strand | length summit tags
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## X01.57035 chr1 25323511-25324015 * | 505 252 45
## X02.23585 chr1 25362685-25362997 * | 313 211 33
## -10*log10(pvalue) fold_enrichment FDR peak feature
## <numeric> <numeric> <numeric> <character> <character>
## X01.57035 59.17 17.01 5.8 X01 57035
## X02.23585 60.63 22.41 4.2 X02 23585
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## X01.57035 25242249 25338213 - inside
## X02.23585 25338317 25362361 + downstream
## distancetoFeature shortestDistance fromOverlappingOrNearest
## <numeric> <integer> <character>
## X01.57035 14702 14198 NearestLocation
## X02.23585 24368 324 NearestLocation
## -------
## seqinfo: 12 sequences from an unspecified genome; no seqlengths
For detailed information about the metadata columns in the result files, please refer to the Section ?? or use the ?annotatePeakInBatch command. Of note, the annotated peaks do not include the gene symbol by default. However, we can easily add this piece of information using the addGeneIDs function, as illustrated in the examples below.
To map gene symbols to the annotated feature IDs, you will need to load the organism annotation package org.Hs.eg.db as it contains the necessary information. Once loaded, you can use theaddGeneIDs function and specify ID2Add = "symbol" to add the gene symbols.
library(org.Hs.eg.db)
macs_peak_ensembl <- addGeneIDs(annotatedPeak = macs_peak_ensembl,
orgAnn = "org.Hs.eg.db", IDs2Add = "symbol",
feature_id_type = "ensembl_gene_id")
head(macs_peak_ensembl, n = 2)
## GRanges object with 2 ranges and 16 metadata columns:
## seqnames ranges strand | length summit
## <Rle> <IRanges> <Rle> | <integer> <integer>
## X01.ENSG00000259984 chr1 25323511-25324015 * | 505 252
## X02.ENSG00000117616 chr1 25362685-25362997 * | 313 211
## tags -10*log10(pvalue) fold_enrichment FDR
## <integer> <numeric> <numeric> <numeric>
## X01.ENSG00000259984 45 59.17 17.01 5.8
## X02.ENSG00000117616 33 60.63 22.41 4.2
## peak feature start_position end_position
## <character> <character> <integer> <integer>
## X01.ENSG00000259984 X01 ENSG00000259984 25336429 25337465
## X02.ENSG00000117616 X02 ENSG00000117616 25242237 25338213
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X01.ENSG00000259984 - downstream 13954
## X02.ENSG00000117616 - upstream -24472
## shortestDistance fromOverlappingOrNearest symbol
## <integer> <character> <character>
## X01.ENSG00000259984 12414 NearestLocation <NA>
## X02.ENSG00000117616 24472 NearestLocation RSRP1
## -------
## seqinfo: 12 sequences from an unspecified genome; no seqlengths
macs_peak_ucsc <- addGeneIDs(annotatedPeak = macs_peak_ucsc,
orgAnn = "org.Hs.eg.db", IDs2Add="symbol",
feature_id_type = "entrez_id")
head(macs_peak_ucsc, n = 2)
## GRanges object with 2 ranges and 16 metadata columns:
## seqnames ranges strand | length summit tags
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## X01.57035 chr1 25323511-25324015 * | 505 252 45
## X02.23585 chr1 25362685-25362997 * | 313 211 33
## -10*log10(pvalue) fold_enrichment FDR peak feature
## <numeric> <numeric> <numeric> <character> <character>
## X01.57035 59.17 17.01 5.8 X01 57035
## X02.23585 60.63 22.41 4.2 X02 23585
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## X01.57035 25242249 25338213 - inside
## X02.23585 25338317 25362361 + downstream
## distancetoFeature shortestDistance fromOverlappingOrNearest
## <numeric> <integer> <character>
## X01.57035 14702 14198 NearestLocation
## X02.23585 24368 324 NearestLocation
## symbol
## <character>
## X01.57035 RSRP1
## X02.23585 TMEM50A
## -------
## seqinfo: 12 sequences from an unspecified genome; no seqlengths
As observed, a symbol metadata column has been successfully added to the resulting GRanges object. It is crucial to ensure the correct selection of feature_id_type for the function to work properly. By default, feature_id_type = "ensembl_gene_id".
GRangesThe peak files are represented as GRanges objects in ChIPpeakAnno and various other packages. To facilitate the conversion of commonly used peak formats including BED, GFF, as well as other popular formats like MACS, MACS2 and CSV, we have implemented a helper function called toGRanges. You can type ?toGRanges to view the complete list of supported formats.
### Convert GFF to GRanges
macs_peak_gff <- system.file("extdata", "GFF_peaks_hg38.gff",
package = "ChIPpeakAnno")
macs_peak_gr1 <- toGRanges(macs_peak_gff, format = "GFF", header = FALSE)
head(macs_peak_gr1, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## [1] chr1 778513-779367 + | bed2gff NA 1 <NA>
## [2] chr1 779643-780198 + | bed2gff NA 1 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
BED to GRangesmacs_peak_bed <- system.file("extdata", "MACS_output_hg38.bed",
package = "ChIPpeakAnno")
macs_peak_gr2 <- toGRanges(macs_peak_bed, format = "BED", header = FALSE)
head(macs_peak_gr2, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_1 chr1 28341-29610 * | 1
## MACS_peak_2 chr1 90821-91234 * | 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Concordant peaks are those that are consistently present in the same genomic regions across replicates. These peaks exhibit a high degree of overlap or similarity in their genomic coordinates and signal intensity. The presence of concordant peaks indicates that the observed signal is likely to be true positive rather than arising from random noise or technical artifacts.
For experiments with two or more biological replicates, there are two strategies to handle the peaks identified from each replicate.
Investigators may choose one of these strategies based on their research questions and the quality of their data. You can obtain concordant peaks and quantitatively evaluate the significance of peak overlaps using functions provided by ChIPpeakAnno.
Here is a sample code that demonstrates how to obtain overlapping peaks for two peak sets. In this example, peaks are considered overlapping if they are within a distance of 1000 base pairs (maxgap = 1000). The code utilized the data function to read in two demo peak sets in GRanges that are bundled with ChIPpeakAnno.
data(peaks1)
data(peaks2)
head(peaks1, n = 2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## Site1 1 967654-967754 +
## Site2 2 2010897-2010997 +
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
head(peaks2, n = 2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## t1 1 967659-967869 +
## t2 2 2010898-2011108 +
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap = 1000)
names(ol)
## [1] "venn_cnt" "peaklist" "uniquePeaks"
## [4] "mergedPeaks" "peaksInMergedPeaks" "overlappingPeaks"
## [7] "all.peaks"
The findOverlapsOfPeaks function returns an object of class overlappingPeaks, which contains the following elements. These elements provide comprehensive information about the overlapping peaks and unique peaks, allowing for further visualization and interpretation of the data.
venn_cnt: an object of VennCounts, which can be used to draw a Venn diagrampeaklist: a list of GRanges objects that contains all merged overlapping peaks or unique peaksuniquePeaks: an object of GRanges that consists of all unique peaksmergedPeaks: an object of GRanges that consists of all merged overlapping peakspeaksInMergedPeaks: an object of GRanges that consists of all peaks in each samples involved in the overlapping peaksoverlappingPeaks: a list of data frames that consists of the annotation of all overlapping peaksall.peaks: a list of GRanges objects that consists the input peaks with formatted rownamesTo determine the number of peaks that are unique to the peaks1 set (i.e., not overlapping with any peaks in peaks2), you can use the following code:
length(ol$peaklist[["peaks1"]])
## [1] 0
Similarly for peaks2:
length(ol$peaklist[["peaks2"]])
## [1] 5
To obtain the merged overlapping peaks, you have two options:
ol$peaklist[["peaks1///peaks2"]], where "peaks1///peaks2" represents a GRanges object that contains all the merged overlapping peaks in peaks1 and peaks2ol$mergedPeaks to obtain all the merged overlapping peaks from all the peak sets in ol. This means that if you have multiple peak sets, this command will output all the peaks that overlap with peaks from any other peak sets.identical(ol$peaklist[["peaks1///peaks2"]], ol$mergedPeaks)
## [1] TRUE
The output from findOverlapsOfPeaks can be directly fed to makeVennDiagram to draw a Venn diagram and evaluate the degree of overlap between peak sets. Additionally, the makeVennDiagram function also provides a P-value, indicating whether the observed overlapping is statistically significant or not.
venn <- makeVennDiagram(ol, totalTest = 100,
fill = c("#009E73", "#F0E442"),
col = c("#D55E00", "#0072B2"),
cat.col = c("#D55E00", "#0072B2"))
Figure 1: Venn diagram showing the number of overlapping peaks for two samples
The parameter totalTest refers to the total number of potential peaks that are considered in the hypergeometric test. It should be set to a value larger than the largest number of peaks in the replicates. Refer to Section @ref(hyper_test) on how to choose totalTest.
In addition to hypergeometric test, an alternative function called peakPermTest has been implemented to circumvent the requirement of estimating the total potential binding sites for the given experiment. For more information about the peakPermTest, you can go to the Section (@ref(perm_test)).
Users can choose to utilize other tools to create Venn diagram by leveraging the ol$venn_cnt object. The code below illustrates how to employ the third-party R package Vennerable for plotting.
if (!library(Vennerable)) {
install.packages("Vennerable", repos="http://R-Forge.R-project.org")
library(Vennerable)
}
n <- which(colnames(ol$venn_cnt) == "Counts") - 1
set_names <- colnames(ol$venn_cnt)[1:n]
weight <- ol$venn_cnt[, "Counts"]
names(weight) <- apply(ol$venn_cnt[, 1:n], 1, base::paste, collapse = "")
v <- Venn(SetNames = set_names, Weight = weight)
plot(v)
As mentioned earlier, two peaks are considered as “overlapping” if their distances are within a maximum distance of 1000bp (maxgap = 1000). To visualize the distribution of the relative positions of peaks1 to peaks2, we can create a pie chart using the ol$overlappingPeaks object. This object is a list that contains detailed information about the relative positions of peaks for each pair of peak sets. For instance, ol$overlappingPeaks[["peaks1\\\peaks2"]] is a data frame that represents the relative positions of overlapping peaks between peaks1 and peaks2.
names(ol$overlappingPeaks)
## [1] "peaks1///peaks2"
dim(ol$overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
ol$overlappingPeaks[["peaks1///peaks2"]][1:2, ]
## peaks1 seqnames start end width strand peaks2 seqnames start
## Site1_t1 Site1 1 967654 967754 101 + t1 1 967659
## Site2_t2 Site2 2 2010897 2010997 101 + t2 2 2010898
## end width strand overlapFeature shortestDistance
## Site1_t1 967869 211 + overlapStart 5
## Site2_t2 2011108 211 + overlapStart 1
unique(ol$overlappingPeaks[["peaks1///peaks2"]][["overlapFeature"]])
## [1] "overlapStart" "inside" "overlapEnd" "includeFeature"
## [5] "downstream"
The column overlapFeature describes the relative positions of peaks between peaks1 and peaks2.
upstream: the peak from peaks1 is located upstream of the peak from peaks2 and is within a distance of maxgapdownstream: the peak from peaks1 is located downstream of the peak from peaks2 and is within a distance of maxgapoverlapStart: the peak from peaks1 overlaps with the start of peak from peaks2overlapEnd: the peak from peaks1 overlaps with the end of peak from peaks2inside: the peak from peaks1 is completely contained within the peak from peaks2includeFeature: the peak from peaks1 contains the entire range of the peak from peaks2. It is important to note that the termFeature here refers to the peak in peaks2 rather than a genomic feature used in peak annotation contextThe utilization of a Venn diagram, in conjunction with a pie chart, enables a more comprehensive evaluation of peak concordance among biological replicates.
The function findOverlapsOfPeaks allows for the analysis of up to five sets of peaks. Here is an additional example for three samples.
data(peaks3)
ol2 <- findOverlapsOfPeaks(peaks1, peaks2, peaks3,
maxgap = 1000,
connectedPeaks = "min")
venn2 <- makeVennDiagram(ol2, totalTest = 100,
fill = c("#CC79A7", "#56B4E9", "#F0E442"),
col = c("#D55E00", "#0072B2", "#E69F00"),
cat.col = c("#D55E00", "#0072B2", "#E69F00"))
Figure 2: Venn diagram showing the number of overlapping peaks for three samples
chatgpt
An annotation file contains genomic coordinates and other relevant details for various genomic features, such as genes, transcripts, promoters, and enhancers. By overlapping the peaks with coordinates in the annotation file, researchers can determine which features are enriched or associated with the peaks. This helps in understanding the functional relevance of the peaks and provides insights into the potential regulatory elements or genes involved.
Popular annotation files come from two sources and are typically stored as tab-delimited format such as GTF and BED:
| Resource | Generated_by | Annotation_criteria | URL |
|---|---|---|---|
| Ensembl | EMBL-EBI | Comprehensive (most transcripts) | https://ftp.ensembl.org/pub/ |
| NCBI RefSeq | NCBI | Conservative (fewer transcripts) | https://ftp.ncbi.nlm.nih.gov/refseq/ |
In addition, UCSC Genome Browser provides processed annotations based on the above resources that can be visualized with the browser. For human and mouse, there are four GTF files provided respectively.
| Human | Mouse | Remark |
|---|---|---|
| hg38.ensGene.gtf.gz (outdated) | mm10.ensGene.gtf.gz | Based on Ensembl gene models?? |
| hg38.ncbiRefSeq.gtf.gz | mm10.knownGene.gtf.gz | Based on RefSeq transcripts as aligned by NCBI |
| hg38.knownGene.gtf.gz | mm10.ncbiRefSeq.gtf.gz | Based on GENCODE gene models (should it be same as Ensembl?) |
| hg38.refGene.gtf.gz | mm10.refGene.gtf.gz | Based on RefSeq transcripts aligned by UCSC followed by manual curation |
TxDb and EnsDbIn Bioconductor, the tab-delimited files are often converted into TxDb or EnsDb class to leverage the various accessor functions (e.g. genes, transcripts) provided with each class. An accessor function is designed to retrieve the desired features from the annotation file. Both TxDb and EnsDb can be created with the tab delimited files mentioned in the above though EnsDb is more tailored to Ensembl annotations and contains additional information such as gene_name, symbol, and gene_biotype compared to the TxDb counterparts converted with RefSeq or UCSC Genome Browser hosted annotations.
Meanwhile, there is a comprehensive list of pre-built Bioconductor maintained TxDb and EnsDb packages such as TxDb.Hsapiens.UCSC.hg38.knownGene and EnsDb.Hsapiens.v86, which are regularly updated and can be found here.
EnsDb and TxDb from AnnotationHubUsers can also retrieve annotations with the AnnotationHub package. By default, AnnotationHub uses a snapshot that is matching the Bioconductor that you are using, thus, might be slightly up-to-date compared to the pre-built packages. And you can switch to an earlier version2 https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html#configuring-annotationhub-objects. Below are examples on how to obtain annotations with AnnotationHub.
EnsDb using AnnotationHublibrary(AnnotationHub)
ah <- AnnotationHub()
EnsDb_Mmusculus_all <- query(ah, pattern = c("Mus musculus", "EnsDb"))
head(EnsDb_Mmusculus_all, n = 2)
## AnnotationHub with 2 records
## # snapshotDate(): 2023-10-20
## # $dataprovider: Ensembl
## # $species: Mus musculus
## # $rdataclass: EnsDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH53222"]]'
##
## title
## AH53222 | Ensembl 87 EnsDb for Mus Musculus
## AH53726 | Ensembl 88 EnsDb for Mus Musculus
EnsDb_Mmusculus <- EnsDb_Mmusculus_all[["AH53222"]]
class(EnsDb_Mmusculus)
## [1] "EnsDb"
## attr(,"package")
## [1] "ensembldb"
TxDb using AnnotationHubTxDb_Mmusculus_all <- query(ah, pattern = c("Mus musculus", "TxDb"))
head(TxDb_Mmusculus_all, n = 2)
## AnnotationHub with 2 records
## # snapshotDate(): 2023-10-20
## # $dataprovider: UCSC
## # $species: Mus musculus
## # $rdataclass: TxDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH52263"]]'
##
## title
## AH52263 | TxDb.Mmusculus.UCSC.mm10.ensGene.sqlite
## AH52264 | TxDb.Mmusculus.UCSC.mm10.knownGene.sqlite
TxDb_Mmusculus <- TxDb_Mmusculus_all[["AH52264"]]
class(TxDb_Mmusculus)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
EnsDb and TxDbTo create the most up-to-date or custom TxDb and EnsDb objects, you can leverage the following functions from the GenomicFeatures3 https://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.html and ensembldb4 https://www.bioconductor.org/packages/devel/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#102_Building_annotation_packages.
GenomicFeatures::makeTxDbFromUCSCGenomicFeatures::makeTxDbFromBiomart (being phased out in favor of makeTxDbFromEnsembl)GenomicFeatures::makeTxDbFromEnsemblGenomicFeatures::makeTxDbFromGFFGenomicFeatures::makeTxDbFromGRangesensembldb::makeEnsemblSQLiteFromTablesensembldb::ensDbFromGtfensembldb::ensDbFromGffensembldb::ensDbFromGRangesbiomaRtThe biomaRt package provides an interface to the BioMart databases that are prominently maintained by Ensembl. By querying via biomaRt, you can access the most up-to-date annotations that are available from Ensembl. Check this vignette for details.
ChIPpeakAnno provides an helper function called getAnnotation for easy retrieval of desired annotations by leveraging biomaRt. Below is an example.
library(biomaRt)
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 110
## 2 ENSEMBL_MART_MOUSE Mouse strains 110
## 3 ENSEMBL_MART_SNP Ensembl Variation 110
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 110
head(listDatasets(useMart("ENSEMBL_MART_ENSEMBL")), n = 2)
## dataset description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2)
## version
## 1 ASM259213v1
## 2 fAstCal1.2
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="mmusculus_gene_ensembl")
anno_from_biomart <- getAnnotation(mart, featureType = "transcript")
head(anno_from_biomart, n = 2)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | description
## <Rle> <IRanges> <Rle> | <character>
## ENSMUST00000082387 MT 1-68 + | mitochondrially enco..
## ENSMUST00000179436 JH584295.1 66-1479 - | <NA>
## ensembl_gene_id
## <character>
## ENSMUST00000082387 ENSMUSG00000064336
## ENSMUST00000179436 ENSMUSG00000095742
## -------
## seqinfo: 39 sequences from an unspecified genome; no seqlengths
Type ?getAnnotation for a full list of supported featureType.
GRangesTo annotate the peaks, the selected annotation file must be converted into GRanges class first. ChIPpeakAnno provides a helper function called toGRanges that can convert annotations from various formats such as GFF, BED, CSV, TxDb, and EnsDb into GRanges. Type ?toGRanges to learn more. Below are few examples.
transcript from TxDb/EnsDblibrary(EnsDb.Hsapiens.v86)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
anno_ensdb_transcript <- toGRanges(EnsDb.Hsapiens.v86,
feature = "transcript")
anno_txdb_transcript <- toGRanges(TxDb.Hsapiens.UCSC.hg38.knownGene,
feature = "transcript")
head(anno_ensdb_transcript, n = 2)
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id gene_id
## <Rle> <IRanges> <Rle> | <character> <character>
## ENST00000456328 chr1 11869-14409 + | ENST00000456328 ENSG00000223972
## ENST00000450305 chr1 12010-13670 + | ENST00000450305 ENSG00000223972
## gene_name
## <character>
## ENST00000456328 DDX11L1
## ENST00000450305 DDX11L1
## -------
## seqinfo: 357 sequences (1 circular) from 2 genomes (hg38, GRCh38)
head(anno_txdb_transcript, n = 2)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_name gene_id
## <Rle> <IRanges> <Rle> | <character> <CharacterList>
## 1 chr1 11869-14409 + | ENST00000456328.2 100287102
## 2 chr1 12010-13670 + | ENST00000450305.2 100287102
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
gene from TxDb/EnsDbanno_ensdb_gene <- toGRanges(EnsDb.Hsapiens.v86,
feature = "gene")
anno_txdb_gene <- toGRanges(TxDb.Hsapiens.UCSC.hg38.knownGene,
feature = "gene")
head(anno_ensdb_gene, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 chr1 11869-14409 + | DDX11L1
## ENSG00000227232 chr1 14404-29570 - | WASH7P
## -------
## seqinfo: 357 sequences (1 circular) from 2 genomes (hg38, GRCh38)
head(anno_txdb_gene, n = 2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr19 58345178-58362751 -
## 10 chr8 18386311-18401218 +
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
Again, if you are working with TxDb or EnsDb, you can obtain features in GRanges format using the accessor functions.
anno_ensdb_gene <- genes(EnsDb.Hsapiens.v86)
anno_ensdb_transcript <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
head(anno_ensdb_gene, n = 2)
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000223972 1 11869-14409 + | ENSG00000223972 DDX11L1
## ENSG00000227232 1 14404-29570 - | ENSG00000227232 WASH7P
## gene_biotype seq_coord_system symbol
## <character> <character> <character>
## ENSG00000223972 transcribed_unproces.. chromosome DDX11L1
## ENSG00000227232 unprocessed_pseudogene chromosome WASH7P
## entrezid
## <list>
## ENSG00000223972 100287596,100287102,727856,...
## ENSG00000227232 <NA>
## -------
## seqinfo: 357 sequences (1 circular) from GRCh38 genome
head(anno_ensdb_transcript, n = 2)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 11869-14409 + | 1 ENST00000456328.2
## [2] chr1 12010-13670 + | 2 ENST00000450305.2
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
getAnnotation functionThe output from getAnnotation are in GRanges, refer to Section 4.5 for details.
Plotting peak distributions helps in quality control by providing an overview of where the peaks are localized across the genome. Unexpected distributions suggest potential issues with the data. In addition, you can select appropriate annotation file depending on the distribution of your peaks. For instance, if peaks are enriched near promoters, you may focus on annotating them with nearby genes.
The function binOverFeature plots peak count distributions relative to a given genomic feature. The following example shows the distribution of peaks in macs_peak_gr2 relative to the gene feature (transcription start site (TSS)).
annotation_data <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
binOverFeature(macs_peak_gr2,
nbins = 20,
annotationData = annotation_data,
xlab = "peak distance from TSS (bp)",
ylab = "peak count",
main = "Distribution of aggregated peak numbers around TSS")
Figure 3: Peak count distribution around TSSs
By default, featureSite = "FeatureStart" meaning that the distance is calculated as peak to transcription start site (TSS). The above plot indicates that peaks are enriched around the TSS, characteristic of peaks obtained from transcription factor binding experiments.
You can also plot peak distribution over multiple genomic features including exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ in a single bar graph with assignChromosomeRegion.
chromosome_region <- assignChromosomeRegion(macs_peak_gr2,
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene,
nucleotideLevel = FALSE,
precedence=c("Promoters",
"immediateDownstream",
"fiveUTRs",
"threeUTRs",
"Exons",
"Introns"))
barplot(chromosome_region[["percentage"]], las = 2)
Figure 4: Bar graph showing peak distributions over different genomic features
The argument las = 2 is to rotate the labels by 90 degree. By default, nucleotideLevel = FALSE meaning that peaks will be treated as ranges when determining overlaps with genomic features. If peak intersects with multiple features, the feature assignment is determined by the order specified in precedence, if precedence not set, counts for each overlapping features will be incremented. Otherwise, if nucleotideLevel = TRUE, the summit of the peak (single position) will be used when determining overlaps.
In addition to inspecting the peak enrichment pattern by plotting distribution against genomic features, user can plot distributions over different feature levels each containing multiple categories with function genomicElementDistribution.
geneLevel
Exons
ExonIntron
Please note that peaks can be classified into multiple categories for different levels, leading to the number of annotated features greater than the total number of peaks. At each level, since peak spans a genomic range, it may overlap with multiple categories of features. In such cases, by default (nucleotideLevel = FALSE), the precedence is determined by the order listed in the labels argument. Otherwise, if nucleotideLevel = TRUE, the summit of the peak (single position) will be used when determining the overlapping features. Type ?genomicElementDistribution to learn more.
The function genomicElementDistribution takes either one peak object or a list of peak objects as its input. For one peak set, a pie chart will be created whereas a bar graph will be created when a list of peak sets is provided.
genomicElementDistribution(macs_peak_gr1,
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene)
Figure 5: Pie graph showing peak distributions over different genomic features
As can be seen, a good amount of peaks come from promoter regions confirming the signature of peaks obtained from transcription factor binding experiments.
macs_peaks <- GRangesList(rep1 = macs_peak_gr1,
rep2 = macs_peak_gr2)
genomicElementDistribution(macs_peaks,
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene)
Figure 6: Bar graph showing peak distributions over different genomic features
The distribution patterns from rep1 and rep2 indicate a high correlation between them.
We can create an UpSet plot to view the overlapping of peaks across multiple genomic features.
library(UpSetR)
res <- genomicElementUpSetR(macs_peak_gr1,
TxDb.Hsapiens.UCSC.hg38.knownGene)
upset(res[["plotData"]],
nsets = length(colnames(res$plotData)),
nintersects = NA)
Figure 7: UpSet plot showing the overlappings of peak distributions among different genomic features
UpSet plot can be regarded as a “high dimensional Venn diagram” that supports viewing overlappings of multiple sets. For example, in the above plot, feature set “gene body” (from “Gene Level”) and “intron” (from “Exon/Intron/Intergenic”) share the highest number of common peaks.
With the annotation data, you can assign the peaks identified in your experiments to nearby features of your choice like genes, transcripts, etc. This process is known as peak annotation. The function annotatePeakInBatch provides extremely flexible ways to perform peak annotation with the output option. For example, peaks can be annotated by the nearest (output = "nearestLocation") or overlapping (output = "overlapping") features (peak-centric method). Alternatively, peaks can be annotated by their relative locations to features (feature-centric method), for example, if a peak is located upstream of a gene within a certain distance (e.g. promoter region), you can assign that gene to the peak (output = "upstream"). The relative locations can be specified with even more flexibility when using the bindingRegion option. Detailed explanations see below.
Whether to use the peak-centric or feature-centric method depends on your research question though most users start with the peak-centric approach as it is easier to interpret.
You can assign the nearest or overlapping features to your peaks by setting the output option to the followings.
output: criteria for feature output
nearestLocation: (default) output the features that are nearest to the peaks as calculated by the absolute value of PeakLocForDistance - FeatureLocForDistaneoverlapping: output all features that are overlapped with the peaks within the maximum distance as specified with the maxgap parameterboth: output the nearest features plus any overlapping features that are not the nearestshortestDistance: output the features that are with the shortest distance to the peaks; the “shortest distance” is determined from either ends of the feature to either ends of the peakOther relevant parameters.
PeakLocForDistance: location of the peak for distance calculation
middle: (recommended) use the center of the peakstart: (default) use the 5’ end (relative to plus strand) of the peakend: use the 3’ end (relative to plus strand) of the peakendMinusStart: use the 3’ end (relative to plus strand) of the peak for features on plus strand and the 5’ end (relative to plus strand) of the peak for features on minus strandFeatureLocForDistance: location of the feature for distance calculation
middle: use the center of the featurestart: use the 5’ end (relative to plus strand) of the featureend: use the 3’ end (relative to plus strand) of the featureTSS: (default) use the 5’ end (relative to plus strand) for features on the plus strand and use the 3’ end (also relative to plus strand) for features on the minus strandgeneEnd: use the 3’ end (relative to plus strand) for features on the plus strand and use the 5’ end (also relative to plus strand) for features on the minus strandmaxgap: the maximum gap that is allowed between the peak and the feature ranges for them to be considered as overlapping; it is defined as the number of positions that separates the two ranges, the gap between two adjacent ranges is 0output = "nearestLocation", the peak-to-feature distance is calculated as abs(PeakLocForDistance - FeatureLocForDistance); for demonstration, PeakLocForDistance = "start" and FeatureLocForDistance = "TSS".
Figure 8: How does peak-centric annotation method work
You can also annotate peaks according to their relative locations to the nearby features. For example, by setting output = "upstream", you will annotate peaks to those features that the peaks reside upstream of.
You can use the following options to specify the desired relative locations of the peaks to the features.
output: criteria for feature output
upstream: output the feature if the peak resides upstream of itupstrem&inside: output the feature if the peak reside upstream of it or inside itinside: output the feature if the peak is wholly contained inside itinside&downstream: output the feature if the peak resides inside it or downstream of itdownstream: output the feature if the peak resides downstream of itupstreamORdownstream: output the feature if the peak resides upstream or downstream of itOther relevant parameter.
maxgap: allowed gap when determining overlap between two ranges; will be ignored if output == "inside"
Figure 9: How does feature-centric annotation method work
When using the feature-centric method, annotatePeakInBatch will output the feature as long as the peak range overlaps with the target region as shown above except when output = "inside" where the entire peak range must be wholly contained inside the feature range.
bindingRegionThe bindingRegion parameter (the term simply refers to the regions that the target TF binds to) makes the feature-centric method even more flexible when defining the target region. For example, if you would like to annotate peaks to those features where the peaks reside at specific regions relative to them, e.g. from 5K upstream to 3K downstream of TSS (where most promoters locate), you can set output = "overlapping", FeatureLocForDistance = "TSS", and bindingRegion = c(-5000, 3000). Once bindingRegion is specified, the maxgap will be ignored.
bindingRegion to define target regions. Use ?annotatePeakInBatch for more examples.
Figure 10: How to define target region with bindingRegion
Bi-directional promoters refer to the genomic regions that are located upstream of TSS of two adjacent genes that are transcribed in opposite directions. Those promoters typically regulate the expression of both genes. To annotate peaks that are located in such regions, you can use output = "overlapping", FeatureLocForDistance = "TSS", plus bindingRegion. By default, select = "all" meaning that all overlapping features will be output. If you want to annotate peaks to features with the nearest bi-directional promoters, you can use output = "nearestBiDirectionalPromoters" plus bindingRegion. In this setting, at most one feature will be reported from each direction. Both maxgap and FeatureLocForDistance will be ignored when using output = nearestBiDirectionalPromoters.
The following example uses myPeakList that comes with ChIPpeakAnno. As different major genome releases (e.g. hg19 vs hg38) might differ in terms of the feature coordinates, it is recommended to use an annotation file that is created with the matching genome used when generating your peak file. For myPeakList, the peaks were originally called against hg18, therefore, we need to use a matching annotation file created with hg18. Use ?myPeakList to learn more about the example peak file. Alternatively, you can liftover your myPeakList to hg38 coordinates with the rtracklayer::liftOver() function, check out “Step3” from Section @ref(custom_pool).
Users can annotate the peaks by assigning the nearby features to them. This can be achieved with the output = "nearestLocation" option. Results from this option may contain “overlapping” features as long as they are the nearest to the peaks.
library(TxDb.Hsapiens.UCSC.hg18.knownGene)
data(myPeakList)
peak_to_anno <- myPeakList[1:100]
anno_data <- transcripts(TxDb.Hsapiens.UCSC.hg18.knownGene)
annotated_peak <- annotatePeakInBatch(peak_to_anno,
output = "nearestLocation",
PeakLocForDistance = "start",
FeatureLocForDistance = "TSS",
AnnotationData = anno_data)
head(annotated_peak, n = 2)
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | peak feature
## <Rle> <IRanges> <Rle> | <character> <character>
## X1_93_556427.ann16 chr1 556660-556760 * | X1_93_556427 ann16
## X1_41_559455.ann19 chr1 559774-559874 * | X1_41_559455 ann19
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## X1_93_556427.ann16 556325 557910 + inside
## X1_41_559455.ann19 559396 560212 + inside
## distancetoFeature shortestDistance
## <numeric> <integer>
## X1_93_556427.ann16 335 335
## X1_41_559455.ann19 378 338
## fromOverlappingOrNearest
## <character>
## X1_93_556427.ann16 NearestLocation
## X1_41_559455.ann19 NearestLocation
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Break down of the options:
output = "nearestLocation": (default)
abs(PeakLocForDistance - FeatureLocForDistance). The output can consist of both “strictly nearest features (non-overlapping)” and “overlapping features” as long as it is the nearestPeakLocForDistance = "start" (default):
"start" means using the 5’ end (relative to plus strand) of the peak to calculate the distance to featuresFeatureLocForDistance = "TSS" (default):
"TSS" means using the 5’ end (relative to plus strand) for features on the plus strand and use the 3’ end (also relative to plus strand) for features on the minus strand to calculate the distance to featuresBreak down of the resulting metadata columns:
peak: id of the peakfeature: id of the feature such as ensembl gene IDstart_position, end_position, feature_strand: feature coordinatesinsideFeature: relative location of the peak to the feature
upstream: peak resides upstream of the featuredownstream: peak resides downstream of the featureinside: peak resides inside the featureoverlapStart: peak overlaps with the start of the featureoverlapEnd: peak overlaps with the end of the featureincludeFeature: peak includes the feature entirelydistancetoFeature: peak-to-feature distance as calculated with abs(PeakLocForDistance - FeatureLocForDistance)shortestDistance: the shortest distance from either end of the peak to either end the featurefromOverlappingOrNearest: relevant only when output is set to "both"; “nearestLocation” indicates that the feature is the closest to the peak, can overlap with the peak; “Overlapping” means that the feature overlaps with the peak and it is not the nearest; when output is set to "nearestLocation", this column should consist of only “NearestLocation”In addition, annotatePeakInBatch can also report both the nearest features as well as overlapping features by setting output = "both" and the maxgap parameter. For example, the following command outputs the nearest features plus all overlapping features that are within 5kb away.
annotated_peak <- annotatePeakInBatch(peak_to_anno,
AnnotationData = anno_data,
output = "both",
maxgap = 5000)
head(annotated_peak, n = 4)
## GRanges object with 4 ranges and 9 metadata columns:
## seqnames ranges strand | peak feature
## <Rle> <IRanges> <Rle> | <character> <character>
## X1_93_556427.ann15 chr1 556660-556760 * | X1_93_556427 ann15
## X1_93_556427.ann16 chr1 556660-556760 * | X1_93_556427 ann16
## X1_93_556427.ann17 chr1 556660-556760 * | X1_93_556427 ann17
## X1_93_556427.ann18 chr1 556660-556760 * | X1_93_556427 ann18
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## X1_93_556427.ann15 554902 555924 + downstream
## X1_93_556427.ann16 556325 557910 + inside
## X1_93_556427.ann17 558012 558705 + upstream
## X1_93_556427.ann18 558707 558776 + upstream
## distancetoFeature shortestDistance
## <numeric> <integer>
## X1_93_556427.ann15 1758 736
## X1_93_556427.ann16 335 335
## X1_93_556427.ann17 -1352 1252
## X1_93_556427.ann18 -2047 1947
## fromOverlappingOrNearest
## <character>
## X1_93_556427.ann15 Overlapping
## X1_93_556427.ann16 NearestLocation
## X1_93_556427.ann17 Overlapping
## X1_93_556427.ann18 Overlapping
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Now, the fromOverlappingOrNearest column consists of both “NearestLocation” and “Overlapping” categories.
The relative location distribution of peak-to-feature can be visualized with the information stored in the insideFeature metadata column.
pie1(table(annotated_peak$insideFeature))
Figure 11: Peak distribution around features
You can also create and pass user defined feature coordinates in GRanges format as annotationData. For example, you may come across a list of transcript factor (TF) binding sites by literature mining and would like to use it to annotate your peaks. To do so, you first need to convert the TF binding site coordinates into a GRanges object, then pass that object into annotatePeakInBatch.
TF_binding_sites <- GRanges(seqnames = c("1", "2", "3", "4", "5", "6", "1",
"2", "3", "4", "5", "6", "6", "6",
"6", "6", "5"),
ranges = IRanges(start = c(967659, 2010898, 2496700,
3075866, 3123260, 3857500,
96765, 201089, 249670,
307586, 312326, 385750,
1549800, 1554400, 1565000,
1569400, 167888600),
end = c(967869, 2011108, 2496920,
3076166,3123470, 3857780,
96985, 201299, 249890, 307796,
312586, 385960, 1550599,
1560799, 1565399, 1571199,
167888999),
names = paste("t", 1:17, sep = "")),
strand = c("+", "+", "+", "+", "+", "+", "-", "-", "-",
"-", "-", "-", "+", "+", "+", "+", "+"))
annotated_peak2 <- annotatePeakInBatch(peaks1, AnnotationData = TF_binding_sites)
head(annotated_peak2, n = 2)
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | peak feature
## <Rle> <IRanges> <Rle> | <character> <character>
## Site1.t1 1 967654-967754 + | Site1 t1
## Site2.t2 2 2010897-2010997 + | Site2 t2
## start_position end_position feature_strand insideFeature
## <integer> <integer> <character> <character>
## Site1.t1 967659 967869 + overlapStart
## Site2.t2 2010898 2011108 + overlapStart
## distancetoFeature shortestDistance fromOverlappingOrNearest
## <numeric> <integer> <character>
## Site1.t1 -5 5 NearestLocation
## Site2.t2 -1 1 NearestLocation
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
Another example of using user defined AnnotationData is to annotate peaks by promoters. Promoter is typically defined as the DNA sequence located immediately upstream of the transcription start site (TSS) of a gene. The specific size of a promoter can vary depending on the gene, its regulatory complexity, and the species being studied. In practice, the promoter region can be defined as 1000bp upstream and 100bp downstream from the TSS. To prepare a custom annotation file containing only promoters, users can levarage the accessor function promoters.
promoter_regions <- promoters(TxDb.Hsapiens.UCSC.hg18.knownGene,
upstream = 1000, downstream = 100)
head(promoter_regions, n = 2)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## uc001aaa.2 chr1 116-1215 + | 1 uc001aaa.2
## uc009vip.1 chr1 116-1215 + | 2 uc009vip.1
## -------
## seqinfo: 49 sequences (1 circular) from hg18 genome
annotated_peak3 <- annotatePeakInBatch(peak_to_anno,
AnnotationData = promoter_regions)
head(annotated_peak3, n = 2)
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | peak
## <Rle> <IRanges> <Rle> | <character>
## X1_93_556427.uc001aba.1 chr1 556660-556760 * | X1_93_556427
## X1_41_559455.uc001abc.1 chr1 559774-559874 * | X1_41_559455
## feature start_position end_position
## <character> <integer> <integer>
## X1_93_556427.uc001aba.1 uc001aba.1 557012 558111
## X1_41_559455.uc001abc.1 uc001abc.1 558396 559495
## feature_strand insideFeature distancetoFeature
## <character> <character> <numeric>
## X1_93_556427.uc001aba.1 + upstream -352
## X1_41_559455.uc001abc.1 + downstream 1378
## shortestDistance fromOverlappingOrNearest
## <integer> <character>
## X1_93_556427.uc001aba.1 252 NearestLocation
## X1_41_559455.uc001abc.1 279 NearestLocation
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Depending on the annotation file you adopted, the feature being assigned to your peaks may have different feature ID. For example, if you are annotating your peaks with genes(TxDb.Hsapiens.UCSC.hg38.knownGene), the feature id that comes with your annotation file is “entrez ID”; if you are annotating your peaks with genes(EnsDb.Hsapiens.v86), the feature id that comes with your annotation is “ensembl gene ID”.
anno_txdb <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
head(anno_txdb$gene_id, n = 5)
## [1] "1" "10" "100" "1000" "100008586"
anno_ensdb <- genes(EnsDb.Hsapiens.v86)
head(anno_ensdb$gene_id, n = 5)
## [1] "ENSG00000223972" "ENSG00000227232" "ENSG00000278267" "ENSG00000243485"
## [5] "ENSG00000237613"
The feature id in your annotation file will be listed as the “feature” metadata column in your annotated peak GRanges object. You can find and link them to other ids such as “symbol” using addGeneIDs functions.
The function addGeneIDs can take either a vector of feature ids or an annotated peak GRanges object as input. It works by creating a map between input feature id and ids being linked to by leveraging either an organism annotation dataset (OrgDb object such as org.Hs.eg.db) or a BioMart dataset (Mart object such as useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")). Therefore, you need to provide the input feature id type via feature_id_type and the feature id types that need to be linked to via IDs2Add argument.
The supported feature_id_type and IDs2Add differ between OrgDb and Mart. Below summarizes the commonly used ids. Use ?addGeneIDs to see full list of supported ids.
OrgDb
feature_id_type: id type for input feature
IDs2Add: id types for features to be linked to
Mart
feature_id_type: id type for input feature, use listFilters(mart) to see full supported list
IDs2Add: id types for features to be linked to, use listAttributes(mart) to see full supported list
The following example demonstrates on using addGeneIDs function to find gene symbols for a vector of entrez IDs with OrgDb. Of note that the "org.Hs.eg.db" must be quoted.
library(org.Hs.eg.db)
entrez_ids <- head(ucsc.hg38.knownGene$gene_id, n = 10)
print(entrez_ids)
## [1] "1" "10" "100" "1000" "100008586" "100008587"
## [7] "100009613" "100009667" "100009676" "10001"
res <- addGeneIDs(entrez_ids,
orgAnn = "org.Hs.eg.db",
feature_id_type = "entrez_id",
IDs2Add = "symbol")
head(res, n = 3)
## entrez_id symbol
## 1 1 A1BG
## 2 10 NAT2
## 3 100 ADA
For this example, we annotate the macs_peak_gr2 (obtained in Section 3) using transcript information from TxDb, and add gene symbols to them.
txdb.hg38.transcript <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
head(txdb.hg38.transcript, n = 4)
## GRanges object with 4 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 11869-14409 + | 1 ENST00000456328.2
## [2] chr1 12010-13670 + | 2 ENST00000450305.2
## [3] chr1 29554-31097 + | 3 ENST00000473358.1
## [4] chr1 30267-31109 + | 4 ENST00000469289.1
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
head(names(txdb.hg38.transcript), n = 4)
## NULL
Looks like txdb.hg38.transcript has a tx_name metadata column that contains the Ensembl transcript ids, however, annotatePeakInBatch requires this piece of information to be stored as names(txdb.hg38.transcript) so as to generate annotated peaks that is compatible with addGeneIDs. Therefore, we need to extract transcript ids and assign them as names first.
tr_id <- txdb.hg38.transcript$tx_name
tr_id <- sub("\\..*$", "", tr_id) # get rid of the trailing version number
names(txdb.hg38.transcript) <- tr_id
head(txdb.hg38.transcript, n = 4)
## GRanges object with 4 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## ENST00000456328 chr1 11869-14409 + | 1 ENST00000456328.2
## ENST00000450305 chr1 12010-13670 + | 2 ENST00000450305.2
## ENST00000473358 chr1 29554-31097 + | 3 ENST00000473358.1
## ENST00000469289 chr1 30267-31109 + | 4 ENST00000469289.1
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
Now, we can annotate macs_peak_gr2 with annotatePeakInBatch.
res <- annotatePeakInBatch(macs_peak_gr2,
AnnotationData = txdb.hg38.transcript)
head(res, n = 2)
## GRanges object with 2 ranges and 10 metadata columns:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_1.ENST00000473358 chr1 28341-29610 * | 1
## MACS_peak_2.ENST00000495576 chr1 90821-91234 * | 1
## peak feature start_position
## <character> <character> <integer>
## MACS_peak_1.ENST00000473358 MACS_peak_1 ENST00000473358 29554
## MACS_peak_2.ENST00000495576 MACS_peak_2 ENST00000495576 89551
## end_position feature_strand insideFeature
## <integer> <character> <character>
## MACS_peak_1.ENST00000473358 31097 + overlapStart
## MACS_peak_2.ENST00000495576 91105 - overlapStart
## distancetoFeature shortestDistance
## <numeric> <integer>
## MACS_peak_1.ENST00000473358 -1213 56
## MACS_peak_2.ENST00000495576 284 129
## fromOverlappingOrNearest
## <character>
## MACS_peak_1.ENST00000473358 NearestLocation
## MACS_peak_2.ENST00000495576 NearestLocation
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The resultingfeature column has ensembl transcript ids as expected. Next, we use the Mart option to add gene symbols.
library(biomaRt)
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl")
res <- addGeneIDs(res,
mart = mart,
feature_id_type = "ensembl_transcript_id",
IDs2Add = "hgnc_symbol")
head(res, n = 2)
## GRanges object with 2 ranges and 11 metadata columns:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_1.ENST00000473358 chr1 28341-29610 * | 1
## MACS_peak_2.ENST00000495576 chr1 90821-91234 * | 1
## peak feature start_position
## <character> <character> <integer>
## MACS_peak_1.ENST00000473358 MACS_peak_1 ENST00000473358 29554
## MACS_peak_2.ENST00000495576 MACS_peak_2 ENST00000495576 89551
## end_position feature_strand insideFeature
## <integer> <character> <character>
## MACS_peak_1.ENST00000473358 31097 + overlapStart
## MACS_peak_2.ENST00000495576 91105 - overlapStart
## distancetoFeature shortestDistance
## <numeric> <integer>
## MACS_peak_1.ENST00000473358 -1213 56
## MACS_peak_2.ENST00000495576 284 129
## fromOverlappingOrNearest hgnc_symbol
## <character> <character>
## MACS_peak_1.ENST00000473358 NearestLocation MIR1302-2HG
## MACS_peak_2.ENST00000495576 NearestLocation
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Use ?listMarts to show available biomart and use ?listDatasets to show available dataset. Be aware that, unlike using OrgDb option, we must supply hgnc_symbol instead of symbol for the IDs2Add argument.
Performing enrichment analysis is a crucial step that helps to determine whether certain biological processes, pathways, or functional categories are over-represented among the genes associated with ChIP-seq peaks. In other words, it provides functional implications to the annotated peaks by annotatePeakInBatch. Gene ontology (GO) and pathway enrichment are two commonly practiced methods for enrichment analysis. They provide insights into gene functions at different levels.
The Gene Ontology is a structured vocabulary that categorizes genes and their products into three main categories, namely, Molecular Function (what it does), Biological Process (why it does it), and Cellular Component (where it does it). A pathway, on the other hand, refers to a set of predefined genes that are involved in a coordinated sequence of molecular events or cellular processes that collectively perform a specific biological function. Examples of biological pathways include “Glycolysis pathway”, which is a metabolic pathway that breaks down glucose into pyruvate; and “MAPK signaling pathway”, which is a signal transduction pathway involved in cell proliferation, differentiation, and response to external stimuli. While GO enrichment analysis provides more general insights, pathway analysis focuses specifically on predefined pathways. Both methods are commonly practiced and can be achieved with the getEnrichedGO and getEnrichedPATH functions provided with ChIPpeakAnno.
In the following demonstration, we extract a subset peak (first 200) from macs_peak_gr2 (obtained in Section 2.1), annotate them with genes from TxDb and perform GO and pathway enrichment analysis. In real situation, full set should be used. To visualize the enriched terms, use enrichmentPlot function.
anno_data <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
annotated_peak4 <- annotatePeakInBatch(macs_peak_gr2[1:200],
AnnotationData = anno_data,
output = "both",
maxgap = 5000)
enriched_go <- getEnrichedGO(annotated_peak4,
orgAnn = "org.Hs.eg.db",
feature_id_type = "entrez_id",
multiAdjMethod = "BH")
enrichmentPlot(enriched_go)
## [1] category
## <0 rows> (or 0-length row.names)
Be aware that the "org.Hs.eg.db" must be quoted, and the feature_id_type must match the id type stored in the feature metadata column of your annotated peak object. If you are using genes from TxDb for peak annotation, the feature is likely entrez_id. When using genes from EnsDb for annotation, the feature should be ensembl_gene_id. Below is to show the number of enriched GO terms in each category, “bp” is for “biological process”, “cc” is for “cellular component”, and “mf” is for “molecular function”.
length(enriched_go[["bp"]][["go.id"]])
## [1] 0
length(enriched_go[["cc"]][["go.id"]])
## [1] 0
length(enriched_go[["mf"]][["go.id"]])
## [1] 0
Using the subset of annotated macs_peak_gr2 leads to zero enriched GO term under the “bp” and “cc” categories, and 6 hits under the “mf” category.
head(enriched_go[["mf"]], n = 2)
## [1] go.id go.term Definition
## [4] Ontology pvalue count.InDataset
## [7] count.InGenome totaltermInDataset totaltermInGenome
## [10] BH.adjusted.p.value EntrezID
## <0 rows> (or 0-length row.names)
For pathway analysis, there are at least two popular databases, namely, Reactome and KEGG (Kyoto Encyclopedia of Genes and Genomes). While Reactome is renowned for its detailed and expert-curated information, KEGG provides a broader scope of information including pathways, diseases, drugs, and more organisms. The function getEnrichedPATH can use either database by specifying the pathAnn parameter. The built-in dataset annotatedPeaks will be used for demonstration. Be aware that feature_id_type = "ensembl_gene_id" for this dataset. Use ?annotatePekas to learn more.
library(reactome.db)
data(annotatedPeak)
enriched_path <- getEnrichedPATH(annotatedPeak,
orgAnn = "org.Hs.eg.db",
feature_id_type = "ensembl_gene_id",
pathAnn = "reactome.db")
To use KEGG database, simply set pathAnn = "KEGGREST".
enrichmentPlot(enriched_path)
Figure 12: Histogram showing enriched pathways
The function getEnrichedGO and getEnrichedPATH require an OrgDb annotation package to work, for less common species that are without a valid OrgDb, users can refer to this post for alternative methods.
A sequence motif is a recurring pattern in DNA that is thought to have a biological function. They usually indicate binding sites for proteins including transcription factors (TF), others play a role at the RNA level such as ribosome binding and transcription termination. For peaks associated with epigenetic markers, motif analysis helps to identify candidate transcription factors. For peaks obtained through experiments such as transcription factor ChIP-seq, motif analysis facilitates the validation of expected binding factors, meanwhile, unanticipated motifs suggest cofactors.
ChIPpeakAnno provides several functions that are related to motif analysis, details see below.
+ getAllPeakSequence: obtain genomic sequences around peaks
+ Obtained sequences can be used for motif discovery or PCR validation
+ oligoSummary: find consensus sequences (motifs) in peak sequences
+ summarizePatternInPeaks: check if given motifs appear in peak sequences
Here is an example to retrieve the peak sequences plus 20bp upstream and 20bp downstream for macs_peak_gr2 peaks obtained in Section 3.
library(BSgenome.Hsapiens.UCSC.hg38)
sequence_around_peak <- getAllPeakSequence(macs_peak_gr2,
upstream = 20,
downstream = 20,
genome = BSgenome.Hsapiens.UCSC.hg38)
head(sequence_around_peak, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | score upstream downstream
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## MACS_peak_1 chr1 28341-29610 * | 1 20 20
## MACS_peak_2 chr1 90821-91234 * | 1 20 20
## sequence
## <character>
## MACS_peak_1 CTGTAGTTGCTCATCTGAAG..
## MACS_peak_2 GCTGCTGCTGTCTGTAGCTG..
## -------
## seqinfo: 1 sequence from an unspecified genome
The genome argument accepts either a BSgenome object or a Mart object. Again, the genome version must match the one used for creating the peak file. For full list of BSgenome, check out this site. The following example demonstrates on how to use Mart. Refer to Section 4.5 for more on Mart. Be aware that this option is slower than using the BSgenome as it will query the BioMart for annotations on the fly if AnnotationData is not set.
library(biomaRt)
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl")
sequence_around_peak <- getAllPeakSequence(macs_peak_gr2[1],
upstream = 20,
downstream = 20,
genome = mart)
To save the sequences into fastq, use function write2FASTA.
write2FASTA(sequence_around_peak, file = "macs_peak_gr2.fa")
The function oligoSummary adopts Markov models to determine whether a motif is enriched in a set of sequences compared to background. Therefore, we first need to estimate the frequencies for all combinations of short oligonucleotides at given length in the background (Leung, Marsh, and Speed 1996). This can be achieved with oligoFrequency. In the following example, we attempt to discover consensus sequences of length 6.
freqs <- oligoFrequency(BSgenome.Hsapiens.UCSC.hg38$chr1)
motif_summary <- oligoSummary(sequence_around_peak,
oligoLength = 6,
MarkovOrder = 3,
freqs = freqs,
quickMotif = TRUE)
Break down of the arguments:
oligoLength: search for motifs with this lengthMarkovOrder: the order of a Markov chain that determines the extent to which the current state in the chain is dependent on previous states; simply put, a zero-order Markov chain does not consider any context or dependencies between adjacent elements while higher-order Markov chains capture longer-range dependencies in sequences and are better suited for modeling more complex sequence patternsfreqs: background motif frequencies used for Markov modelquickMotif: generate the motif matrices or notThe resulting motif_summary is a list containing three elements:
zscore: the Z-score of each oligonucleotide quantifies how many standard deviations the observed count is from the expected count; a high positive score suggests that the motif is significantly over-represented in the peakcounts: the count numbe of each oligonucleotidemotifs: a list of motif matrices when quickMotif = TRUEWe can use histogram (hist) to visualize the resulting Z-scores and add labels with the text function. Below, we label the name of the motif that has the highest Z-score.
zscore <- sort(motif_summary$zscore)
h <- hist(zscore, breaks = 100, main = "Histogram of Z-score")
text(x = zscore[length(zscore)],
y = h$counts[length(h$counts)] + 1,
labels = names(zscore[length(zscore)]),
adj = 0,
srt = 90)
Figure 13: Histogram showing Z-score distribution
For comparison, the following snippet plots the Z-score distribution with simulation data. Here, we first simulate 5000 sequences with lengths ranging from 100 to 2000 nucleotides. We then randomly insert motif 1 and motif 2 into 10%, and 15% of the reads respectively.
set.seed(1)
# motifs to simulate
simulate_motif_1 <- "AATTTA"
simulate_motif_2 <- "TGCATG"
# sample 5000 sequences with lengths ranging from 100 to 2000 nucleotides
# randomly insert motif_1 to 10% of the sequences, and motif_2 to 15% of the sequences
simulation_seqs <- sapply(sample(c(1, 2, 0),
5000,
prob = c(0.1, 0.15, 0.75),
replace = TRUE),
function(x) {
seq <- sample(c("A", "T", "C", "G"),
sample.int(1900, 1) + 100,
replace = TRUE)
insert_pos <- sample.int(length(seq) - 6, 1)
if (x == 1) {
seq[insert_pos:(insert_pos + 5)] <- strsplit(simulate_motif_1, "")[[1]]
} else if (x == 2) {
seq[insert_pos:(insert_pos + 5)] <- strsplit(simulate_motif_2, "")[[1]]
}
paste(seq, collapse = "")
}
)
motif_summary_simu <- oligoSummary(simulation_seqs,
oligoLength = 6,
MarkovOrder = 3,
quickMotif = TRUE)
zscore_simu <- sort(motif_summary_simu$zscore,
decreasing = TRUE)
h_simu <- hist(x = zscore,
breaks = 100,
main = "Histogram of Z-score for simulation data")
text(x = zscore_simu[1:2],
y = rep(5, 2),
labels = names(zscore_simu[1:2]),
adj = 0,
srt = 90)
Figure 14: Histogram showing Z-score distribution for simulation data
As can be seen from the simulation results, the higher the probability of the motif, the larger the resulting Z-score.
You can also visualize the motif with motifStack package.
library(motifStack)
pfm <- new("pfm", mat = motif_summary$motifs[[1]],
name = "sample motif 1")
motifStack(pfm)
To loop through each element in motif_summary$motifs, you can use the function mapply.
pfms <- mapply(function(motif, id) { new("pfm", mat = motif, name = as.character(id)) },
motif_summary$motifs,
1:length(motif_summary$motifs))
motifStack(pfms[[1]])
Figure 15: Plot showing the first motif detected
If you have a list of motifs (sequence patterns), you can use function summarizePatternInPeaks to see if they appear in the peaks or not.
example_pattern_file <- system.file("extdata/examplePattern.fa",
package = "ChIPpeakAnno")
readLines(example_pattern_file)
## [1] ">ExamplePattern" "GGNCCK" ">ExamplePattern" "AACCNM"
pattern_in_peak <- summarizePatternInPeaks(patternFilePath = example_pattern_file,
BSgenomeName = BSgenome.Hsapiens.UCSC.hg38,
peaks = macs_peak_gr2[1:200])
head(pattern_in_peak, n = 2)
## chr motifStart motifEnd motif name motif motifStartOffset motifEndOffset
## 1 1 28746 28751 ExamplePattern GGNCCK 406 411
## 2 1 29049 29054 ExamplePattern GGNCCK 709 714
## motif found motifFoundStrand seqnames start end width strand score
## 1 GGACCG + chr1 28341 29610 1270 * 1
## 2 GGGCCG + chr1 28341 29610 1270 * 1
In addition to ChIPpeakAnno, users can extract the sequences with getAllPeakSequence and leverage other tools like MEME Suite for motif-based sequence analysis.
Given two or more peak sets from different experiments (e.g. two TFs), it might be interesting to see if the peak profiles are correlated, and if so, how does the peak patterns compare to each other. ChIPpeakAnno not only provides functions to test if there is a significant overlap among multiple peak sets, but also offer visualization function for easy comparison of peak patterns.
The significance of overlap can be determined with hypergeometric test or permutation test, both are available in ChIPpeakAnno.
Hypergeometric test is based on the concept of hypergeometric distribution, which is a probability distribution that describes the chance of drawing a certain number of successes (k) from a sample of size n, given a finite population (N) that contains K successes when sampling without replacement. The null hypothesis is that the sample is drawn randomly from the population, which means that the two peak sets do not overlap. A small P-value indicates that the null should be rejected and that the two peak sets overlap significantly.
In ChIPpeakAnno, the hypergeometric test is implemented in the makeVennDiagram function, also see Section 3.2.2. The example below demonstrates on how to figure out the hypergeometric test P-values for peak sets from three TF ChIP-seq experiments.
tf1 <- toGRanges(system.file("extdata/TAF.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
tf2 <- toGRanges(system.file("extdata/Tead4.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
tf3 <- toGRanges(system.file("extdata/YY1.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
For hypergeometric test to work, we first need to estimate the total number of binding sites, i.e. totalTest. The choice of totalTest affects the stringency of the test, with smaller values leading to more conservative outcome (larger P-values). For practical guidance on how to choose an appropriate value for totalTest, you can refer to the post.
For our example, we assume that potential binding regions (coding regions + promoter regions) take up 3% of the entire genome. Since the example data is from chromosome 2, we can estimate the number of total binding sites as (length of chr24) * 3% / (mean peak length).
overlapping_peaks <- findOverlapsOfPeaks(tf1,
tf2,
tf3,
connectedPeaks = "keepAll")
mean_peak_width <- mean(width(unlist(GRangesList(overlapping_peaks[["all.peaks"]]))))
total_binding_sites <- length(BSgenome.Hsapiens.UCSC.hg38[["chr2"]]) * 0.03 / mean_peak_width
venn1 <- makeVennDiagram(overlapping_peaks,
totalTest = total_binding_sites,
connectedPeaks = "keepAll",
fill = c("#CC79A7", "#56B4E9", "#F0E442"),
col = c("#D55E00", "#0072B2", "#E69F00"),
cat.col = c("#D55E00", "#0072B2", "#E69F00"))
Figure 16: Venn diagram1 showing overlapping peaks
For P-values of each peak pair:
venn1[["p.value"]]
## tf1 tf2 tf3 pval
## [1,] 0 1 1 4.565072e-52
## [2,] 1 0 1 0.000000e+00
## [3,] 1 1 0 2.744460e-88
For overlapping peak counts:
venn1[["vennCounts"]]
## tf1 tf2 tf3 Counts count.tf1 count.tf2 count.tf3
## [1,] 0 0 0 4953.594 0 0 0
## [2,] 0 0 1 621.000 0 0 621
## [3,] 0 1 0 2097.000 0 2097 0
## [4,] 0 1 1 309.000 0 310 319
## [5,] 1 0 0 59.000 59 0 0
## [6,] 1 0 1 166.000 172 0 170
## [7,] 1 1 0 8.000 8 8 0
## [8,] 1 1 1 476.000 545 537 521
## attr(,"class")
## [1] "VennCounts"
The connectedPeaks = "keepAll" means that when multiple peaks from different groups involve in overlapping, all original peak counts for each group will be displayed in parenthesis while the minimally involved peak count will be shown outside the parenthesis. When connectedPeaks = "keepFirstListConsistent", the counts from the first group will be consistently kept.
venn2 <- makeVennDiagram(overlapping_peaks,
totalTest = total_binding_sites,
connectedPeaks = "keepFirstListConsistent",
fill = c("#CC79A7", "#56B4E9", "#F0E442"),
col = c("#D55E00", "#0072B2", "#E69F00"),
cat.col = c("#D55E00", "#0072B2", "#E69F00"))
Figure 17: Venn diagram2 showing overlapping peaks
Since the P-values are all very small, the null hypothesis must be rejected, and that each pair of the peak sets overlap significantly with each other. A major drawback of this approach is that we must estimate a totalTest, which could dramatically affect the test outcome. For instance, if we choose “2%” over “3%” in the above example, the P-value for tf1 vs. tf2 becomes 0.49 so that we can no longer reject the null. To circumvent the requirement of totalTest, we incorporated permutation test into peakPermTest function.
The permutation test is a non-parametric test meaning that it does not require data to follow any specific distribution. The test statistic is determined according to the observed data, and the null distribution of the test statistic is estimated using a permutation (re-sampling) procedure.
In our case, the number of overlapping peaks is considered as the test statistic, whose null distribution is estimated by first re-sampling peaks from a random peak list (the peak pool that represent all potential binding sites) followed by counting the number of overlapping peaks. The random peak list is generated using the distributions discovered from the input peaks, to ensure that the peak widths and relative binding positions to the features (such as TSS and geneEnd), follow the same distributions as the input peaks. When null hypothesis holds, the number of overlapping peaks are not significantly different from by chance.
The function peakPermTest can generate peak pool automatically given the peak binding type (bindingType = c("TSS", "geneEnd")), annotation type (featureType = c("transcript", "exon")), and annotation data (Txdb). Below are sample codes.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
set.seed(123)
# Example1: non-relevant peak sets
utr5 <- unique(unlist(fiveUTRsByTranscript(txdb)))
utr3 <- unique(unlist(threeUTRsByTranscript(txdb)))
utr5 <- utr5[sample.int(length(utr5), 1000)]
utr3 <- utr3[sample.int(length(utr3), 1000)]
pt1 <- peakPermTest(peaks1 = utr3,
peaks2 = utr5,
TxDb = txdb,
maxgap = 500,
seed = 1)
plot(pt1)
Figure 18: Permutation test for non-relevant peak sets
# Example2: highly relevant peak sets
cds <- unique(unlist(cdsBy(txdb)))
ol <- findOverlaps(cds, utr3, maxgap = 1)
peaks2 <- c(cds[sample.int(length(cds), 500)],
cds[queryHits(ol)][sample.int(length(ol), 500)])
pt2 <- peakPermTest(peaks1 = utr3,
peaks2 = peaks2,
TxDb = txdb,
maxgap = 500,
seed = 1)
plot(pt2)
Figure 19: Permutation test for highly relevant peak sets
As can be seen, for highly relevant peak sets, the P-value is very small.
Instead of creating random peaks from input peaks, users can choose to build the peak pool from real binding sites obtained from experimental data with hot spots removed. Hot spots are genomic regions with high probability of being bound by many TFs in ChIP-seq experiments (Yip et al. 2012). We suggest to remove hot spots before permutation test to avoid over-estimating the association between two input peak sets. Users are encouraged to remove ENCODE blacklist regions as well. The blacklists were constructed by identifying consistently problematic regions over independent cell lines and types of experiment for each species in the ENCODE and modENCODE datasets (Consortium 2012).
Below is an example to create a peak pool for human genome using the transcription factor binding site clusters downloaded from ENCODE. The following steps serve as an example.
# Step1: download TF binding sites
temp <- tempfile()
download.file(file.path("https://hgdownload.cse.ucsc.edu/",
"goldenpath/",
"hg38/",
"encRegTfbsClustered/",
"encRegTfbsClusteredWithCells.hg38.bed.gz"),
temp)
df_tfbs <- read.delim(gzfile(temp, "r"), header = FALSE)
unlink(temp)
colnames(df_tfbs)[1:4] <- c("seqnames", "start", "end", "TF")
tfbs_hg38 <- GRanges(as.character(df_tfbs$seqnames),
IRanges(df_tfbs$start, df_tfbs$end),
TF = df_tfbs$TF)
# Step2: download hot spots
base_url <- "http://metatracks.encodenets.gersteinlab.org/metatracks/"
temp1 <- tempfile()
temp2 <- tempfile()
download.file(file.path(base_url,
"HOT_All_merged.tar.gz"),
temp1)
download.file(file.path(base_url,
"HOT_intergenic_All_merged.tar.gz"),
temp2)
untar(temp1, exdir = dirname(temp1))
untar(temp2, exdir = dirname(temp1))
bedfiles <- dir(dirname(temp1), "bed$")
hot_spots_hg19 <- sapply(file.path(dirname(temp1), bedfiles), toGRanges, format = "BED")
unlink(temp1)
unlink(temp2)
names(hot_spots_hg19) <- gsub("_merged.bed", "", bedfiles)
hot_spots_hg19 <- sapply(hot_spots_hg19, unname)
hot_spots_hg19 <- GRangesList(hot_spots_hg19)
# Step3: liftover hot spots to hg38
library(R.utils)
temp_chain_gz <- tempfile()
temp_chain <- tempfile()
base_url_chain <- "http://hgdownload.cse.ucsc.edu/goldenpath/"
download.file(file.path(base_url_chain,
"hg19/liftOver/",
"hg19ToHg38.over.chain.gz"),
temp_chain_gz)
gunzip(filename = temp_chain_gz, destname = temp_chain)
chain_file <- import.chain(hg19_to_hg38)
unlink(temp_chain_gz)
unlink(temp_chain)
hot_spots_hg38 <- liftOver(hot_spots_hg19, chain_file)
# Step4: select peak sets to test
tfbs_hg38_by_TF <- split(tfbs_hg38, tfbs_hg38$TF)
TAF1 <- tfbs_hg38_by_TF[["TAF1"]]
TEAD4 <- tfbs_hg38_by_TF[["TEAD4"]]
# Step5: remove hot spots from binding pool
remove_ol <- function(gr, gr_hot_spots) {
# helper function to remove overlaps with hot_spots
ol <- findOverlaps(gr, gr_hot_spots)
if (length(ol) > 0) gr <- gr[-unique(queryHits(ol))]
gr
}
tfbs_hg38 <- remove_ol(tfbs_hg38, hot_spots_hg38)
tfbs_hg38 <- reduce(tfbs_hg38)
# Step6: perform permutation test
pool <- new("permPool",
grs = GRangesList(tfbs_hg38),
N = length(TAF1))
pt3 <- peakPermTest(TAF1, TEAD4, pool = pool, ntimes = 500)
plot(pt3)
Figure 20: Permutation test using custom peak pool
If you have peak files obtained from multiple TF ChIP-seq experiments and would like to compare their binding patterns using raw signals such as read coverage. ChIPpeakAnno provides two functions, featureAlignedHeatmap and featureAlignedDistribution, to visualize their binding patterns side-by-side.
To demonstrate, we first need to prepare both the peak data and coverage information (bigWig). Four steps are involved.
# Step1: read in example peak files
extdata_path <- system.file("extdata", package = "ChIPpeakAnno")
broadPeaks <- dir(extdata_path, "broadPeak")
gr_TFs <- sapply(file.path(extdata_path, broadPeaks), toGRanges, format = "broadPeak")
names(gr_TFs) <- gsub(".broadPeak", "", broadPeaks)
names(gr_TFs)
## [1] "TAF" "Tead4" "YY1"
Now, we have read in peak files for three TFs (“TAF”, “Tead”, “YY1”) into gr_TFs object. Next step is to find overlapping peaks to ensure that we are comparing binding patterns over the same genomic regions.
# Step2: find peaks that are shared by all
ol <- findOverlapsOfPeaks(gr_TFs)
gr_TFs_ol <- ol$peaklist$`TAF///Tead4///YY1`
# Step3: read in example coverage files
# here we read in coverage data from -2000bp to -2000bp of each shared peak center
gr_TFs_ol_center <- reCenterPeaks(gr_TFs_ol, width = 4000) # use the center of the peaks and extend 2000bp upstream and downstream to obtain peaks with uniform length of 4000bp
bigWigs <- dir(extdata_path, "bigWig")
coverage_list <- sapply(file.path(extdata_path, bigWigs),
import, # rtracklayer::import
format = "BigWig",
which = gr_TFs_ol_center,
as = "RleList")
names(coverage_list) <- gsub(".bigWig", "", bigWigs)
names(coverage_list)
## [1] "TAF" "Tead4" "YY1"
# Step4: visualize binding patterns
sig <- featureAlignedSignal(coverage_list, gr_TFs_ol_center)
# since the bigWig files are only a subset of the original files,
# filter to keep peaks that are with coverage data for all peak sets
keep <- rowSums(sig[[1]]) > 0 & rowSums(sig[[2]]) > 0 & rowSums(sig[[3]]) > 0
sig <- sapply(sig, function(x) x[keep, ], simplify = FALSE)
gr_TFs_ol_center <- gr_TFs_ol_center[keep]
featureAlignedHeatmap(sig, gr_TFs_ol_center,
upper.extreme=c(3, 0.5, 4))
Figure 21: Heatmap of coverages for selected TFs
By default, the rows in the heatmap are ordered by the total coverage per row from the first sample (TAF in this example). We can reorder the rows by tuning the sortBy option, for example, setting sortBy = "YY1" will order the rows by the sample “YY1” in the dataset. You can also sort the rows by hierarchical clustering results, below is to demonstrate.
# perform hierarchical clustering on rows
sig_rowsums <- sapply(sig, rowSums, na.rm = TRUE)
row_distance <- dist(sig_rowsums)
hc <- hclust(row_distance)
# user hierarchical clustering order to sort
gr_TFs_ol_center$sort_by <- hc$order
featureAlignedHeatmap(sig, gr_TFs_ol_center,
upper.extreme = c(3, 0.5, 4),
sortBy = "sort_by")
Figure 22: Heatmap of coverages for selected TFs sorted by hierarchical clustering
In addition, we can generate density plot with featureAlignedDistribution function.
featureAlignedDistribution(sig, gr_TFs_ol_center,
type = "l")
Figure 23: Distribution of coverage densities for selected TFs
For experiments targeting a single TF with replicates, a common analytic strategy is outlined below.
BED/GFF files into GRangesCommon peak formats such as BED, GFF, and MACS can be converted into GRanges format using toGRanges function. After importing the data, concordance across peak replicates can be evaluated with findOverlapsOfPeaks and makeVennDiagram. In addition, the meta data columns will be dropped for the merged overlapping peaks. To add them back, we can leverage the addMetadata function. For example, addMetadata(ol, colNames = "score", FUN = mean) will add “score” column to each merged overlapping peak by taking the mean score of overlapping peaks involved.
library(ChIPpeakAnno)
# Convert BED/GFF into GRanges
bed1 <- system.file("extdata", "MACS_output_hg38.bed",
package = "ChIPpeakAnno")
gr1 <- toGRanges(bed1, format = "BED", header = FALSE)
gff1 <- system.file("extdata", "GFF_peaks_hg38.gff",
package = "ChIPpeakAnno")
gr2 <- toGRanges(gff1, format = "GFF", header = FALSE)
# Find overlapping peaks
ol <- findOverlapsOfPeaks(gr1, gr2)
# Add "score" meta column to overlapping peaks
ol <- addMetadata(ol, colNames = "score", FUN = mean)
head(ol$mergedPeaks, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## [2] chr1 789471-791811 * | gr2__003,gr1__MACS_peak_14
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
venn <- makeVennDiagram(ol,
fill = c("#009E73", "#F0E442"),
col = c("#D55E00", "#0072B2"),
cat.col = c("#D55E00", "#0072B2"))
Figure 24: Venn diagram showing the number of overlapping peaks for gr1 and gr2
For P-value of hypergeometric test:
venn[["p.value"]]
## gr1 gr2 pval
## [1,] 1 1 0
For overlapping peak counts:
venn[["vennCounts"]]
## gr1 gr2 Counts count.gr1 count.gr2
## [1,] 0 0 0 0 0
## [2,] 0 1 61 0 61
## [3,] 1 0 63 63 0
## [4,] 1 1 165 167 168
## attr(,"class")
## [1] "VennCounts"
As can be seen, the extremely small P-value suggests that the two peak sets are highly relevant, reflecting a good consistency among experimental replicates.
As with the peak files, the annotation file must be converted into a GRanges object. Annotation GRanges can be constructed from not only BED, GFF, user defined txt files, but alsoEnsDb, TxDb objects with toGRanges function. For EnsDb and TxDb object, annotation can also be prepared with accessor functions, more see Section @ref(txdb_ensdb). Note that the version of genome used for creating the annotation file must match with the genome used for peak calling because the feature coordinates may vary across different genome releases. For example, if you are using Mus_musculus.v103 for mapping, you’d best use EnsDb.Mmusculus.v103 for annotation.
The following example converts feature genes from EnsDb to annotation data using accessor function.
library(EnsDb.Hsapiens.v86)
ensembl.hs86.gene <- toGRanges(EnsDb.Hsapiens.v86, feature = "gene")
head(ensembl.hs86.gene, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 chr1 11869-14409 + | DDX11L1
## ENSG00000227232 chr1 14404-29570 - | WASH7P
## -------
## seqinfo: 357 sequences (1 circular) from 2 genomes (hg38, GRCh38)
Now given merged overlapping peaks and annotation data, we can visualize the distribution of the distance of merged overlapping peaks to the nearest features such as genes (TSSs) by binOverFeature function.
binOverFeature(ol$mergedPeaks,
nbins = 20,
annotationData = ensembl.hs86.gene,
xlab = "peak distance from TSSs (bp)",
ylab = "peak count",
main = "Distribution of aggregated peak numbers around TSS")
Figure 25: Peak count distribution around transcription start sites
We can use genomicElementDistribution to summarize the distribution of peaks over different types of genomic features such “exon”, “intron”, “enhancer”, “UTR”, etc. When inputting a single peak file, a pie graph will be created.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
genomicElementDistribution(ol$mergedPeaks,
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene)
Figure 26: Pie graph showing peak distributions over different genomic features
As can be seen, a good amount of peaks come from promoter regions consistent with the signature of peaks obtained from transcription factor binding experiments. When inputting a list of peak sets (e.g. replicates), a bar graph will be created.
macs_peaks <- GRangesList(rep1 = gr1,
rep2 = gr2)
genomicElementDistribution(macs_peaks,
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene)
Figure 27: Bar graph showing peak distributions over different genomic features for two replicates
The two peak replicates are consistent according to the bar graph.
To visualize peak overlapping for multiple feature sets, we can leverage UpSet plot (details see Section 5.3).
library(UpSetR)
res <- genomicElementUpSetR(ol$mergedPeak,
TxDb.Hsapiens.UCSC.hg38.knownGene)
upset(res[["plotData"]],
nsets = length(colnames(res$plotData)),
nintersects = NA)
Figure 28: UpSet plot showing peak overlappings for multiple features
According to the above distribution of aggregated peak numbers around TSS and the distribution of peaks in different chromosome regions, most of the peaks locate near TSS. Therefore, it is reasonable to adopt the feature-centric method to annotate the peaks residing in the promoter regions. Promoters can be specified with bindingRegion. In the following example, the promoter region is defined as 2K upstream and 500 downstream of TSS (bindingRegion = c(-2000, 500)).
ol_anno <- annotatePeakInBatch(ol$mergedPeak,
AnnotationData = ensembl.hs86.gene,
output = "nearestBiDirectionalPromoters",
bindingRegion = c(-2000, 500))
head(ol_anno, n = 2)
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## X1 chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## X1 chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## peak feature feature.ranges feature.strand distance
## <character> <character> <IRanges> <Rle> <integer>
## X1 X1 ENSG00000228327 725885-778626 - 0
## X1 X1 ENSG00000237491 778770-810060 + 0
## insideFeature distanceToSite gene_name
## <character> <integer> <character>
## X1 overlapStart 0 RP11-206L10.2
## X1 overlapStart 0 RP11-206L10.9
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
You can save the annotation to a CSV file.
ol_anno <- unname(ol_anno) # remove names to avoid duplicate row.names error
ol_anno$peakNames <- NULL # remove peakNames to avoid unimplemented type 'list' error
write.csv(as.data.frame(ol_anno), "ol_anno.csv")
To visualize the distribution of peaks around features:
pie1(table(ol_anno$insideFeature))
Figure 29: Pie chart showing peak distribution around features
In addition to using the output = "nearestBiDirectionalPromoters" option, ChIPpeakAnno provides another helper function called peaksNearBDP to retrieve statistics for peaks located in bi-directional promoters.
peaks_near_BDP <- peaksNearBDP(ol$mergedPeaks,
AnnotationData = ensembl.hs86.gene,
MaxDistance = 5000)
# MaxDistance will be translated into "bindingRegion =
# c(-MaxDistance, MaxDistance)" internally
peaks_near_BDP$n.peaks
## [1] 165
peaks_near_BDP$n.peaksWithBDP
## [1] 18
peaks_near_BDP$percentPeaksWithBDP
## [1] 0.1090909
head(peaks_near_BDP$peaksWithBDP, n = 2)
## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 10 metadata columns:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## X1 chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## X1 chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## bdp_idx peak feature feature.ranges feature.strand
## <integer> <character> <character> <IRanges> <Rle>
## X1 1 X1 ENSG00000228327 725885-778626 -
## X1 1 X1 ENSG00000237491 778770-810060 +
## distance insideFeature distanceToSite gene_name
## <integer> <character> <integer> <character>
## X1 0 overlapStart 0 RP11-206L10.2
## X1 0 overlapStart 0 RP11-206L10.9
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $`4`
## GRanges object with 2 ranges and 10 metadata columns:
## seqnames ranges strand | peakNames bdp_idx
## <Rle> <IRanges> <Rle> | <CharacterList> <integer>
## X4 chr1 920981-921619 * | gr1__MACS_peak_17,gr2__005 4
## X4 chr1 920981-921619 * | gr1__MACS_peak_17,gr2__005 4
## peak feature feature.ranges feature.strand distance
## <character> <character> <IRanges> <Rle> <integer>
## X4 X4 ENSG00000223764 916865-921016 - 0
## X4 X4 ENSG00000187634 923928-944581 + 2308
## insideFeature distanceToSite gene_name
## <character> <integer> <character>
## X4 overlapStart 0 RP11-54O7.3
## X4 upstream 2308 SAMD11
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Enhancers are DNA sequences that can increase the expression of genes and are commonly used as biomarkers for cancer diagnosis and treatment. They can be identified using various experimental methods such as 3C, 5C, and HiC (Lieberman-Aiden et al. 2009). With enhancers obtained through experimental approaches, we can find peaks that bind to the potential enhancer regions. The following example uses 5C data obtained with hg19 genome assembly, therefore, we need to use a matching annotation file.
library(EnsDb.Hsapiens.v75)
ensembl.hs75.gene <- toGRanges(EnsDb.Hsapiens.v75, feature = "gene")
DNA5C <- system.file("extdata",
"wgEncodeUmassDekker5CGm12878PkV2.bed.gz",
package="ChIPpeakAnno")
DNAinteractiveData <- toGRanges(gzfile(DNA5C))
# the example bed.gz file can also be downloaded from:
# https://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUmassDekker5C/wgEncodeUmassDekker5CGm12878PkV2.bed.gz
peaks_near_enhancer <- findEnhancers(peaks = ol$mergedPeaks,
annoData = ensembl.hs75.gene,
DNAinteractiveData = DNAinteractiveData)
head(peaks_near_enhancer, n = 2)
## GRanges object with 1 range and 14 metadata columns:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## X1 chr1 151619224-151619324 * | gr2__229,gr1__MACS_peak_229
## feature feature.ranges feature.strand feature.shift.ranges
## <character> <IRanges> <Rle> <IRanges>
## X1 ENSG00000143393 151264273-151300191 - 151619414-151655333
## feature.shift.strand distance insideFeature distanceToSite gene_name
## <Rle> <integer> <character> <integer> <character>
## X1 + 89 upstream 89 PI4KB
## DNAinteractive.idx peak DNAinteractive.ranges DNAinteractive.blocks
## <integer> <character> <IRanges> <IRangesList>
## X1 814 X1 151283079-151636526 1-5699,335673-353448
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
With annotated peaks, we can use getEnrichedGO and getEnrichedPATH to obtain a list of enriched GO or pathway terms respectively.
library(org.Hs.eg.db)
enriched_go <- getEnrichedGO(annotatedPeak = ol_anno,
orgAnn = "org.Hs.eg.db",
feature_id_type = "ensembl_gene_id",
condense = TRUE)
enrichmentPlot(enriched_go)
Figure 30: Histogram showing enriched GO terms
Similarly, below is for pathway enrichment analysis.
library(reactome.db)
enriched_pathway <- getEnrichedPATH(annotatedPeak = ol_anno,
orgAnn = "org.Hs.eg.db",
pathAnn = "reactome.db",
maxP = 0.05)
enrichmentPlot(enriched_pathway)
Figure 31: Histogram showing enriched pathways
To perform motif analysis, we first need to extract sequences surrounding the peaks, then obtain the consensus sequences. We can also visualize the top motifs discovered.
library(BSgenome.Hsapiens.UCSC.hg38)
seq_around_peak <- getAllPeakSequence(ol$mergedPeaks,
upstream = 20,
downstream = 20,
genome = BSgenome.Hsapiens.UCSC.hg38)
head(seq_around_peak, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | peakNames
## <Rle> <IRanges> <Rle> | <CharacterList>
## [1] chr1 778411-780198 * | gr1__MACS_peak_13,gr2__001,gr2__002
## [2] chr1 789471-791811 * | gr2__003,gr1__MACS_peak_14
## upstream downstream sequence
## <numeric> <numeric> <character>
## [1] 20 20 TGCGCCATGTTGCTAGGCTG..
## [2] 20 20 AGAATTGAATTGAATGGACT..
## -------
## seqinfo: 1 sequence from an unspecified genome
We can use write2FASTA function to save the result to a fasta file. The following snippet obtains Z-scores for short oligos of length 6.
freqs <- oligoFrequency(BSgenome.Hsapiens.UCSC.hg38$chr1)
motif_summary <- oligoSummary(seq_around_peak,
oligoLength = 6,
MarkovOrder = 3,
freqs = freqs,
quickMotif = TRUE)
zscore <- sort(motif_summary$zscore)
h <- hist(zscore,
breaks = 100,
main = "Histogram of Z-score")
text(x = zscore[length(zscore)],
y = h$counts[length(h$counts)] + 1,
labels = names(zscore[length(zscore)]),
adj = 0,
srt = 90)
Figure 32: Histogram showing Z-score distribution for oligos of length 6
We can use motifStack to visualize the top motifs discovered.
library(motifStack)
pfm <- new("pfm", mat = motif_summary$motifs[[1]],
name = "sample motif 1")
motifStack(pfm)
Given two or more peak sets from different TFs, it might be interesting to see if the peak profiles are correlated, and if so, how does the peak patterns compare to each other. The workflow here shows how to compare binding profiles of three TFs. Steps are outlined below.
tf1 <- toGRanges(system.file("extdata/TAF.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
tf2 <- toGRanges(system.file("extdata/Tead4.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
tf3 <- toGRanges(system.file("extdata/YY1.broadPeak", package = "ChIPpeakAnno"),
format = "broadPeak")
To test the associations across different peak sets, ChIPpeakAnno implements both hypergeometric test (makeVennDiagram, details see Section @ref(hyper_test)) and permutation test (peakPermTest, details see Section @ref(perm_test)). For demonstration, here we use hypergeometric test.
library(BSgenome.Hsapiens.UCSC.hg38)
overlapping_peaks <- findOverlapsOfPeaks(tf1,
tf2,
tf3,
connectedPeaks = "keepAll")
mean_peak_width <- mean(width(unlist(GRangesList(overlapping_peaks[["all.peaks"]]))))
total_binding_sites <- length(BSgenome.Hsapiens.UCSC.hg38[["chr2"]]) * 0.03 / mean_peak_width
venn1 <- makeVennDiagram(overlapping_peaks,
totalTest = total_binding_sites,
connectedPeaks = "keepAll",
fill = c("#CC79A7", "#56B4E9", "#F0E442"),
col = c("#D55E00", "#0072B2", "#E69F00"),
cat.col = c("#D55E00", "#0072B2", "#E69F00"))
Figure 33: Venn diagram showing overlapping peaks for three TFs
For P-values of each peak pair:
venn1[["p.value"]]
## tf1 tf2 tf3 pval
## [1,] 0 1 1 4.565072e-52
## [2,] 1 0 1 0.000000e+00
## [3,] 1 1 0 2.744460e-88
Since P-values are all very small, each pair of the peak sets overlap significantly with each other.
We can leverage heatmap and density plot to visualize and compare the binding patterns of multiple TFs. For detailed instructions, refer to @ref(multi_exp).
ol_tfs <- findOverlapsOfPeaks(tf1, tf2, tf3,
connectedPeaks = "keepAll")
gr_ol_tfs <- ol_tfs$peaklist$`tf1///tf2///tf3`
TF_width <- width(gr_ol_tfs)
gr_ol_tfs_center <- reCenterPeaks(gr_ol_tfs, width = 4000)
extdata_path <- system.file("extdata", package = "ChIPpeakAnno")
bigWigs <- dir(extdata_path, "bigWig")
coverage_list <- sapply(file.path(extdata_path, bigWigs),
import, # rtracklayer::import
format = "BigWig",
which = gr_ol_tfs_center,
as = "RleList")
names(coverage_list) <- gsub(".bigWig", "", bigWigs)
sig <- featureAlignedSignal(coverage_list, gr_ol_tfs_center)
# since the bigWig files are only a subset of the original files,
# filter to keep peaks that are with coverage data for all peak sets
keep <- rowSums(sig[[1]]) > 0 & rowSums(sig[[2]]) > 0 & rowSums(sig[[3]]) > 0
sig <- sapply(sig, function(x) x[keep, ], simplify = FALSE)
gr_ol_tfs_center <- gr_ol_tfs_center[keep]
featureAlignedHeatmap(sig, gr_ol_tfs_center,
upper.extreme=c(3, 0.5, 4))
Figure 34: Heatmap of coverages for selected TFs
By default, the rows are ordered by the total coverage per row from the first sample. Below is an example to sort by the third sample (“YY1”) in the dataset by tuning the sortBy option.
featureAlignedHeatmap(sig, gr_ol_tfs_center,
upper.extreme=c(3, 0.5, 4),
sortBy = "YY1")
Figure 35: Heatmap of coverages for selected TFs sorted by YY1 sample
To create density plot, use featureAlignedDistribution function.
featureAlignedDistribution(sig, gr_ol_tfs_center,
type="l")
Figure 36: Distribution of coverage densities for selected TFs
As can be seen, while the binding of “YY1” is most significant, the binding of “Tead” is much weaker.
For usage related questions, please submit your questions on the Bioconductor Support Site. For bug report and novel feature request, please raise an issue on the ChIPpeakAnno GitHub repository.
ChIPpeakAnno implements a helper function toGRanges that is designed to import files. Users can use other tools such as rtracklayer::import as well .
ChIPpeakAnno provides multiple options when annotating peaks, which can be specified with the output argument in the annotatePeakInBatch function.
output = "nearestLocation", output = "shortestDistance", or output = "overlapping". This is a peak-centric method, details see Section @ref(peak_centric).output = "upstream". If you would like more flexibility, you can set the bindingRegion option. This is a feature-centric method, details see Section @ref(feature_centric)annotatePeakInBatch or annoPeaks?You should use annotatePeakInBatch because it is the master wrapper function that would invoke annoPeaks internally. Historically, annotatePeakInBatch mainly consisted of the peak-centric methods at the beginning, feature-centric methods were implemented in annoPeaks function and were eventually incorporated into annotatePeakInBatch.
annotatePeakInBatch?The annotation data must be in GRanges format for annotatePeakInBatch to work. You can convert TxDb, EnsDb, or even GFF and BED, etc. into GRanges with toGRanges. You can also leverage the getAnnotation function to retrieve annotation data from “BioMart”. Moreover, you can even create your own customized annotation file. Detailed instructions see Section ??.
This question is in response to this post.
When you set output = "both", both the “nearest” and “overlapping” features will be output. If you would like to assign up to one type of feature to each peak (either “nearest” or “overlapping” but not both, and prefer “overlapping” if the “nearest” feature is not “overlapping”), you can use the following strategy: first, annotate peaks to the overlapping features; second, annotate the peaks that don’t have overlapping features to the nearest features; last, concatenate the two. Below is some example codes.
library(ensembldb)
library(EnsDb.Hsapiens.v75)
data(myPeakList)
annoData <- annoGR(EnsDb.Hsapiens.v75)
# Step1: annotate peaks to the overlapping features, if "select = 'all'", multiple features can be assigned to a single peak.
anno_overlapping <- annotatePeakInBatch(myPeakList,
AnnotationData = annoData,
output = "overlapping",
select = "first")
anno_overlapping_non_na <- anno_overlapping[!is.na(anno_overlapping$feature)]
# Step2: annotate peaks that are without overlapping features to nearest features
myPeakList_non_overlapping <- myPeakList[!(names(myPeakList) %in% anno_overlapping_non_na$peak)]
anno_nearest <- annotatePeakInBatch(myPeakList_non_overlapping,
AnnotationData = annoData,
output = "nearestLocation",
select = "first")
# Step3: concatenate the two
anno_final <- c(anno_overlapping_non_na, anno_nearest)
The above code assigns either the “overlapping” or “nearest” feature to each peak, and if “overlapping” feature is not the “nearest”, only the “overlapping” one will be reported.
This question is very typical for calculating intersection of peaks. This is because peak is a range of continuous points instead of a single point. Consider intersecting two lists of range objects: list A (1~3, 4~5, 7~9) and list B (2~8), how many ranges should we output as the overlapping list? It would be one if we use list B as reference and three if we use list A as reference.
findOverlapsOfPeaks count the number of overlapping peaks?This question is in response to this post.
When counting the number of overlapping peaks using findOverlapsOfPeaks, there is a connectedPeaks option. If multiple peaks are involved in any group of connected/overlapping peaks in any input peak list, set connectedPeaks = "merge" will add one to the overlapping counts, while set connectedPeaks = "min" will add the minimal involved peaks in each group of connected/overlapped peaks to the overlapping counts. Set connectedPeaks = "keepAll" will add the number of involved peaks for each peak list to the corresponding overlapping counts. In addition, it will output counts as if connectedPeaks was set to “min”.
For examples, if 5 peaks in group1 overlap with 2 peaks in group 2, setting connectedPeaks = "merge" will add 1 to the overlapping counts; setting connectedPeaks = "min" will add 2 to the overlapping counts; setting connectedPeaks ="keepAll" will add 5 peaks to count.group1, 2 to count.group2, and 2 to counts.
You can set connectedPeaks = "keepAll" in the findOverlapsOfPeaks and makeVennDiagram function.
In the output of findOverlapsOfPeaks, there is a column in the metadata of each element in peaklist, called peakNames, which is a CharacterList. The CharacterList is a list of the contributing peak ids with prefix, e.g. peaks1_peakname1, peaksi_peaknamej where i refers to the peak group i where j refers to the peak j in that peak group. Users can access the original peak name by splitting those strings. Here is the sample code:
library(ChIPpeakAnno)
library(reshape2)
# Step1: read in two peak files
bed <- system.file("extdata",
"MACS_output.bed",
package = "ChIPpeakAnno")
gff <- system.file("extdata",
"GFF_peaks.gff",
package = "ChIPpeakAnno")
gr1 <- toGRanges(bed, format = "BED",
header = FALSE)
gr2 <- toGRanges(gff, format = "GFF",
header = FALSE, skip = 3)
names(gr2) <- seq(length(gr2)) # add names to gr2 for Step4
# Step2: find overlapping peaks
ol <- findOverlapsOfPeaks(gr1, gr2)
peakNames <- ol$peaklist[['gr1///gr2']]$peakNames
# Step3: extract original peak names
peakNames <- melt(peakNames,
value.name = "merged_peak_id") # reshape df
head(peakNames, n = 2)
## merged_peak_id NA NA
## 1 1 1 gr1__MACS_peak_13
## 2 1 1 gr2__1
peakNames <- cbind(peakNames[, 1],
do.call(rbind,
strsplit(as.character(peakNames[, 3]), "__")))
colnames(peakNames) <- c("merged_peak_id", "group", "peakName")
head(peakNames, n = 2)
## merged_peak_id group peakName
## [1,] "1" "gr1" "MACS_peak_13"
## [2,] "1" "gr2" "1"
# Step4: split by peak group
gr1_subset <- gr1[peakNames[peakNames[, "group"] %in% "gr1", "peakName"]]
gr2_subset <- gr2[peakNames[peakNames[, "group"] %in% "gr2", "peakName"]]
head(gr1_subset, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## MACS_peak_13 chr1 713791-715332 * | 2550.61
## MACS_peak_14 chr1 724851-727191 * | 58.34
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(gr2_subset, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## 1 chr1 713893-714747 + | bed2gff region_0 0 <NA>
## 2 chr1 715023-715578 + | bed2gff region_1 0 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Alternatively, the following snippet should also work.
gr1_renamed <- ol$all.peaks$gr1
gr2_renamed <- ol$all.peaks$gr2
head(gr1_renamed, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## gr1__MACS_peak_1 chr1 28341-29610 * | 160.81
## gr1__MACS_peak_2 chr1 90821-91234 * | 133.12
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(gr2_renamed, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## gr2__1 chr1 713893-714747 + | bed2gff region_0 0 <NA>
## gr2__2 chr1 715023-715578 + | bed2gff region_1 0 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
peakNames <- melt(ol$peaklist[['gr1///gr2']]$peakNames,
value.name = "merged_peak_id")
gr1_subset <- gr1_renamed[peakNames[grepl("^gr1", peakNames[, 3]), 3]]
gr2_subset <- gr2_renamed[peakNames[grepl("^gr2", peakNames[, 3]), 3]]
head(gr1_subset, n = 2)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## gr1__MACS_peak_13 chr1 713791-715332 * | 2550.61
## gr1__MACS_peak_14 chr1 724851-727191 * | 58.34
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(gr2_subset, n = 2)
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | source type score phase
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
## gr2__1 chr1 713893-714747 + | bed2gff region_0 0 <NA>
## gr2__2 chr1 715023-715578 + | bed2gff region_1 0 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
totalTest when using makeVennDiagram?When we test the association between two sets of peak data based on hypergeometric distribution, the number of all potential binding sites is required. The parameter totalTest in the function makeVennDiagram specifies how many potential peaks in total will be used in the hypergeometric test. It should be larger than the largest number of peaks in the peak list. The smaller it is set, the more stringent the test is (larger P-values). The time used to calculate P-value does not depend on the value of the totalTest. For practical guidance on how to choose totalTest, please refer to this post.
If you use ChIPpeakAnno in your work, please cite it as follows:
## Please cite the paper below for the ChIPpeakAnno package.
##
## Lihua J Zhu, Claude Gazin, Nathan D Lawson, Herve Pages, Simon M Lin,
## David S Lapointe and Michael R Green, ChIPpeakAnno: a Bioconductor
## package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics.
## 2010, 11:237
##
## Zhu LJ. Integrative analysis of ChIP-chip and ChIP-seq dataset.
## Methods Mol Biol. 2013;1067:105-24.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
Here is the output of sessionInfo() on the system on which this document was
compiled running pandoc 3.1.1:
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] reshape2_1.4.4
## [2] EnsDb.Hsapiens.v75_2.99.0
## [3] motifStack_1.46.0
## [4] BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [5] BSgenome_1.70.1
## [6] rtracklayer_1.62.0
## [7] BiocIO_1.12.0
## [8] Biostrings_2.70.1
## [9] XVector_0.42.0
## [10] reactome.db_1.86.0
## [11] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
## [12] UpSetR_1.4.0
## [13] biomaRt_2.58.0
## [14] AnnotationHub_3.10.0
## [15] BiocFileCache_2.10.1
## [16] dbplyr_2.4.0
## [17] knitr_1.45
## [18] org.Hs.eg.db_3.18.0
## [19] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
## [20] EnsDb.Hsapiens.v86_2.99.0
## [21] ensembldb_2.26.0
## [22] AnnotationFilter_1.26.0
## [23] GenomicFeatures_1.54.1
## [24] AnnotationDbi_1.64.1
## [25] Biobase_2.62.0
## [26] ChIPpeakAnno_3.35.3
## [27] testthat_3.2.0
## [28] GenomicRanges_1.54.1
## [29] GenomeInfoDb_1.38.1
## [30] IRanges_2.36.0
## [31] S4Vectors_0.40.1
## [32] BiocGenerics_0.48.1
## [33] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.1 later_1.3.1
## [3] bitops_1.0-7 filelock_1.0.2
## [5] R.oo_1.25.0 tibble_3.2.1
## [7] graph_1.80.0 DirichletMultinomial_1.44.0
## [9] XML_3.99-0.14 lifecycle_1.0.3
## [11] rprojroot_2.0.3 processx_3.8.2
## [13] lattice_0.22-5 MASS_7.3-60
## [15] magrittr_2.0.3 sass_0.4.7
## [17] rmarkdown_2.25 jquerylib_0.1.4
## [19] yaml_2.3.7 remotes_2.4.2.1
## [21] httpuv_1.6.12 grImport2_0.3-1
## [23] sessioninfo_1.2.2 pkgbuild_1.4.2
## [25] CNEr_1.38.0 DBI_1.1.3
## [27] ade4_1.7-22 abind_1.4-5
## [29] pkgload_1.3.3 zlibbioc_1.48.0
## [31] R.utils_2.12.2 purrr_1.0.2
## [33] RCurl_1.98-1.12 pracma_2.4.2
## [35] rappdirs_0.3.3 GenomeInfoDbData_1.2.11
## [37] seqLogo_1.68.0 annotate_1.80.0
## [39] codetools_0.2-19 DelayedArray_0.28.0
## [41] xml2_1.3.5 tidyselect_1.2.0
## [43] futile.logger_1.4.3 farver_2.1.1
## [45] base64enc_0.1-3 matrixStats_1.0.0
## [47] GenomicAlignments_1.38.0 jsonlite_1.8.7
## [49] multtest_2.58.0 ellipsis_0.3.2
## [51] survival_3.5-7 tools_4.3.1
## [53] progress_1.2.2 TFMPvalue_0.0.9
## [55] Rcpp_1.0.11 glue_1.6.2
## [57] gridExtra_2.3 SparseArray_1.2.2
## [59] xfun_0.40 MatrixGenerics_1.14.0
## [61] usethis_2.2.2 dplyr_1.1.3
## [63] withr_2.5.2 formatR_1.14
## [65] BiocManager_1.30.22 fastmap_1.1.1
## [67] fansi_1.0.5 caTools_1.18.2
## [69] callr_3.7.3 digest_0.6.33
## [71] R6_2.5.1 mime_0.12
## [73] colorspace_2.1-0 GO.db_3.18.0
## [75] poweRlaw_0.70.6 gtools_3.9.4
## [77] jpeg_0.1-10 RSQLite_2.3.2
## [79] R.methodsS3_1.8.2 utf8_1.2.4
## [81] generics_0.1.3 prettyunits_1.2.0
## [83] InteractionSet_1.30.0 httr_1.4.7
## [85] htmlwidgets_1.6.2 S4Arrays_1.2.0
## [87] TFBSTools_1.40.0 regioneR_1.34.0
## [89] pkgconfig_2.0.3 gtable_0.3.4
## [91] blob_1.2.4 brio_1.1.3
## [93] htmltools_0.5.7 profvis_0.3.8
## [95] bookdown_0.36 RBGL_1.78.0
## [97] ProtGenerics_1.34.0 scales_1.2.1
## [99] png_0.1-8 lambda.r_1.2.4
## [101] rstudioapi_0.15.0 tzdb_0.4.0
## [103] rjson_0.2.21 curl_5.1.0
## [105] cachem_1.0.8 stringr_1.5.0
## [107] BiocVersion_3.18.0 parallel_4.3.1
## [109] miniUI_0.1.1.1 restfulr_0.0.15
## [111] desc_1.4.2 pillar_1.9.0
## [113] vctrs_0.6.4 urlchecker_1.0.1
## [115] promises_1.2.1 xtable_1.8-4
## [117] evaluate_0.23 readr_2.1.4
## [119] VennDiagram_1.7.3 cli_3.6.1
## [121] compiler_4.3.1 futile.options_1.0.1
## [123] Rsamtools_2.18.0 rlang_1.1.1
## [125] crayon_1.5.2 labeling_0.4.3
## [127] ps_1.7.5 plyr_1.8.9
## [129] fs_1.6.3 stringi_1.7.12
## [131] BiocParallel_1.36.0 munsell_0.5.0
## [133] lazyeval_0.2.2 devtools_2.4.5
## [135] Matrix_1.6-1.1 hms_1.1.3
## [137] bit64_4.0.5 ggplot2_3.4.4
## [139] KEGGREST_1.42.0 seqinr_4.2-30
## [141] shiny_1.7.5.1 SummarizedExperiment_1.32.0
## [143] interactiveDisplayBase_1.40.0 highr_0.10
## [145] memoise_2.0.1 bslib_0.5.1
## [147] bit_4.0.5